Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fscrawler throws error when using flag --loop 1 #547

Closed
jeanp413 opened this issue Jul 4, 2018 · 6 comments · Fixed by #570
Closed

fscrawler throws error when using flag --loop 1 #547

jeanp413 opened this issue Jul 4, 2018 · 6 comments · Fixed by #570
Assignees
Labels
bug For confirmed bugs
Milestone

Comments

@jeanp413
Copy link

jeanp413 commented Jul 4, 2018

Hi, I hope someone could help me with this error.
Every time I run fscrawler with --loop 1 i got the error Got a hard failure when executing the bulk request and the data doesn't get send to elasticsearch. If I run fscrawler without that option all works fine and I can see the data.

I'm using the following fscrawler 2.5 snapshot fscrawler-2.5-20180215.233518-30.zip

00:40:53,904 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
00:40:53,907 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
00:40:53,907 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
00:40:53,907 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
00:40:53,907 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
00:40:53,908 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
00:40:53,911 DEBUG [f.p.e.c.f.c.FsCrawler] Cleaning existing status for job [documents]...
00:40:53,911 DEBUG [f.p.e.c.f.c.FsCrawler] Starting job [documents]...
00:40:55,167 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch >= 5, so we can use ingest node feature
00:40:55,264 WARN  [f.p.e.c.f.c.FsCrawler] We found old configuration index settings in [/root/fscrawlerconf] or [/root/fscrawlerconf/documents/_mappings]. You should look at the documentation about upgrades: https://github.com/dadoonet/fscrawler#upgrade-to-23
00:40:55,264 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
00:40:55,268 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] FS crawler connected to an elasticsearch [6.2.4] node.
00:40:55,268 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [jp_fscrawler]
00:40:55,289 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [jp_fscrawler_folders]
00:40:55,307 DEBUG [f.p.e.c.f.FsCrawlerImpl] creating fs crawler thread [documents] for [/root/documents] every [15m]
00:40:55,308 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [documents] for [/root/documents] every [15m]
00:40:55,308 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [documents] is now running. Run #1...
00:40:55,321 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/root/documents, /root/documents) = /
00:40:55,323 DEBUG [f.p.e.c.f.FsCrawlerImpl] Indexing jp_fscrawler_folders/doc/37b24843cf1efa280da2ab311183b10?pipeline=null
00:40:55,331 DEBUG [f.p.e.c.f.FsCrawlerImpl] indexing [/root/documents] content
00:40:55,331 DEBUG [f.p.e.c.f.c.FileAbstractor] Listing local files from /root/documents
00:40:55,337 DEBUG [f.p.e.c.f.c.FileAbstractor] 5 local files found
00:40:55,338 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [Chronology of Shakespeare's Plays.xlsx], includes = [null], excludes = [[~*]]
00:40:55,338 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [Chronology of Shakespeare's Plays.xlsx], excludes = [[~*]]
00:40:55,338 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [Chronology of Shakespeare's Plays.xlsx], includes = [null]
00:40:55,338 DEBUG [f.p.e.c.f.FsCrawlerImpl] [Chronology of Shakespeare's Plays.xlsx] can be indexed: [true]
00:40:55,338 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: Chronology of Shakespeare's Plays.xlsx
00:40:55,339 DEBUG [f.p.e.c.f.FsCrawlerImpl] fetching content from [/root/documents],[Chronology of Shakespeare's Plays.xlsx]
00:40:55,342 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/root/documents, /root/documents/Chronology of Shakespeare's Plays.xlsx) = /Chronology of Shakespeare's Plays.xlsx
00:40:55,376 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is disabled. Even though it's detected, it must be disabled explicitly
00:40:55,766 WARN  [o.a.t.p.PDFParser] JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

00:40:56,610 DEBUG [f.p.e.c.f.FsCrawlerImpl] Indexing jp_fscrawler/doc/4fb5512ebfa73c5574e35c69cd7d1ca7?pipeline=fscrawler-pipeline
00:40:56,613 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [Hamlet.pdf], includes = [null], excludes = [[~*]]
00:40:56,613 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [Hamlet.pdf], excludes = [[~*]]
00:40:56,613 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [Hamlet.pdf], includes = [null]
00:40:56,613 DEBUG [f.p.e.c.f.FsCrawlerImpl] [Hamlet.pdf] can be indexed: [true]
00:40:56,613 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: Hamlet.pdf
00:40:56,613 DEBUG [f.p.e.c.f.FsCrawlerImpl] fetching content from [/root/documents],[Hamlet.pdf]
00:40:56,615 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/root/documents, /root/documents/Hamlet.pdf) = /Hamlet.pdf
00:40:58,085 DEBUG [f.p.e.c.f.FsCrawlerImpl] Indexing jp_fscrawler/doc/e4384cdfa4b2093ec70810b0fa8e87?pipeline=fscrawler-pipeline
00:40:58,092 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [Galileo.ppt], includes = [null], excludes = [[~*]]
00:40:58,092 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [Galileo.ppt], excludes = [[~*]]
00:40:58,093 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [Galileo.ppt], includes = [null]
00:40:58,093 DEBUG [f.p.e.c.f.FsCrawlerImpl] [Galileo.ppt] can be indexed: [true]
00:40:58,093 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: Galileo.ppt
00:40:58,093 DEBUG [f.p.e.c.f.FsCrawlerImpl] fetching content from [/root/documents],[Galileo.ppt]
00:40:58,093 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/root/documents, /root/documents/Galileo.ppt) = /Galileo.ppt
00:40:58,734 DEBUG [f.p.e.c.f.FsCrawlerImpl] Indexing jp_fscrawler/doc/a386243762b7217b69e2d0323082312f?pipeline=fscrawler-pipeline
00:40:58,735 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [ang_ref_plato_the_republic_01.odt], includes = [null], excludes = [[~*]]
00:40:58,735 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [ang_ref_plato_the_republic_01.odt], excludes = [[~*]]
00:40:58,735 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [ang_ref_plato_the_republic_01.odt], includes = [null]
00:40:58,735 DEBUG [f.p.e.c.f.FsCrawlerImpl] [ang_ref_plato_the_republic_01.odt] can be indexed: [true]
00:40:58,735 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: ang_ref_plato_the_republic_01.odt
00:40:58,735 DEBUG [f.p.e.c.f.FsCrawlerImpl] fetching content from [/root/documents],[ang_ref_plato_the_republic_01.odt]
00:40:58,736 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/root/documents, /root/documents/ang_ref_plato_the_republic_01.odt) = /ang_ref_plato_the_republic_01.odt
00:40:58,832 DEBUG [f.p.e.c.f.FsCrawlerImpl] Indexing jp_fscrawler/doc/372088d22cf28d1896a8fc45596328?pipeline=fscrawler-pipeline
00:40:58,832 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [meditations-by-marcus-aurelius.doc], includes = [null], excludes = [[~*]]
00:40:58,832 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [meditations-by-marcus-aurelius.doc], excludes = [[~*]]
00:40:58,833 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [meditations-by-marcus-aurelius.doc], includes = [null]
00:40:58,833 DEBUG [f.p.e.c.f.FsCrawlerImpl] [meditations-by-marcus-aurelius.doc] can be indexed: [true]
00:40:58,833 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: meditations-by-marcus-aurelius.doc
00:40:58,833 DEBUG [f.p.e.c.f.FsCrawlerImpl] fetching content from [/root/documents],[meditations-by-marcus-aurelius.doc]
00:40:58,833 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/root/documents, /root/documents/meditations-by-marcus-aurelius.doc) = /meditations-by-marcus-aurelius.doc
00:40:58,934 DEBUG [f.p.e.c.f.FsCrawlerImpl] Indexing jp_fscrawler/doc/5f777d7c4ba4172ea20339f88f2442?pipeline=fscrawler-pipeline
00:40:58,935 DEBUG [f.p.e.c.f.FsCrawlerImpl] Looking for removed files in [/root/documents]...
00:40:59,072 DEBUG [f.p.e.c.f.FsCrawlerImpl] Looking for removed directories in [/root/documents]...
00:40:59,093 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler is stopping after 1 run
00:40:59,107 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [documents]
00:40:59,107 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
00:40:59,107 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
00:40:59,151 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
00:40:59,155 WARN  [f.p.e.c.f.c.ElasticsearchClientManager] Got a hard failure when executing the bulk request
org.apache.http.ConnectionClosedException: Connection closed unexpectedly
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.closed(HttpAsyncRequestExecutor.java:140) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.client.InternalIODispatch.onClosed(InternalIODispatch.java:71) [httpasyncclient-4.1.2.jar:4.1.2]
	at org.apache.http.impl.nio.client.InternalIODispatch.onClosed(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2]
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.disconnected(AbstractIODispatch.java:100) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionClosed(BaseIOReactor.java:279) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processClosedSessions(AbstractIOReactor.java:440) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:283) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [httpcore-nio-4.4.5.jar:4.4.5]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
00:40:59,167 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
00:40:59,167 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [documents] stopped
00:40:59,168 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [documents]
00:40:59,169 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
00:40:59,169 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
00:40:59,169 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
00:40:59,169 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
00:40:59,169 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [documents] stopped

@dadoonet
Copy link
Owner

dadoonet commented Jul 4, 2018

Thanks for reporting with all details!

@dadoonet dadoonet added the bug For confirmed bugs label Jul 4, 2018
@dadoonet dadoonet self-assigned this Jul 4, 2018
@dadoonet dadoonet added this to the 2.5 milestone Jul 4, 2018
@dadoonet
Copy link
Owner

I reproduced it hardly as it seems to be a race condition. Not sure yet how to fix it properly. Will try.

dadoonet added a commit that referenced this issue Jul 13, 2018
fscrawler throws error when using flag --loop 1
@jeanp413
Copy link
Author

It always throws an error for me. Some more info: I'm running it in a ubuntu docker container with openjdk-8-jdk

  • Command:
    fscrawler --config_dir /root/fscrawlerconf documents --debug --loop 1 --restart
  • Config:
{
  "name" : "documents",
  "fs" : {
    "url" : "/root/documents",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : false,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : false,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "indexed_chars": "100%",
    "ocr" : {
      "language" : "spa"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "elasticsearch",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s",
    "index" : "jp_fscrawler",
    "index_folder" : "jp_fscrawler_folders",
    "pipeline" : "fscrawler-pipeline"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "0.0.0.0",
    "port" : 9500,
    "endpoint" : "fscrawler"
  }
}

dadoonet added a commit that referenced this issue Jul 16, 2018
fscrawler throws error when using flag --loop 1
dadoonet added a commit that referenced this issue Jul 16, 2018
fscrawler throws error when using flag --loop 1
@dadoonet
Copy link
Owner

I just ran it 3 times:

Config:

{
  "name" : "documents",
  "fs" : {
    "url" : "/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents"
  }
}

Command to run it:

$ ./fscrawler-2.5-SNAPSHOT/bin/fscrawler --config_dir ~/Documents/Elasticsearch/work/fscrawler/547/config documents --debug --loop 1 --restart

First run

10:30:46,364 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
10:30:46,367 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
10:30:46,367 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
10:30:46,368 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
10:30:46,368 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
10:30:46,368 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
10:30:46,372 DEBUG [f.p.e.c.f.c.FsCrawler] Cleaning existing status for job [documents]...
10:30:46,374 DEBUG [f.p.e.c.f.c.FsCrawler] Starting job [documents]...
10:30:47,071 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch >= 5, so we can use ingest node feature
10:30:47,122 WARN  [f.p.e.c.f.c.FsCrawler] We found old configuration index settings in [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/config] or [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/config/documents/_mappings]. You should look at the documentation about upgrades: https://github.com/dadoonet/fscrawler#upgrade-to-23
10:30:47,122 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
10:30:47,127 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] FS crawler connected to an elasticsearch [6.3.2] node.
10:30:47,128 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [documents]
10:30:47,470 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [documents_folder]
10:30:47,708 DEBUG [f.p.e.c.f.FsParser] creating fs crawler thread [documents] for [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents] every [15m]
10:30:47,709 INFO  [f.p.e.c.f.FsParser] FS crawler started for [documents] for [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents] every [15m]
10:30:47,709 DEBUG [f.p.e.c.f.FsParser] Fs crawler thread [documents] is now running. Run #1...
10:30:47,718 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents, /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents) = /
10:30:47,720 DEBUG [f.p.e.c.f.FsParser] Indexing documents_folder/doc/80d7e9f67615b3abb9157d946223a19?pipeline=null
10:30:47,725 DEBUG [f.p.e.c.f.FsParser] indexing [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents] content
10:30:47,725 DEBUG [f.p.e.c.f.c.FileAbstractor] Listing local files from /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents
10:30:47,730 DEBUG [f.p.e.c.f.c.FileAbstractor] 1 local files found
10:30:47,730 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents, /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents/foo.txt) = /foo.txt
10:30:47,731 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/foo.txt], includes = [null], excludes = [null]
10:30:47,731 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/foo.txt], excludes = [null]
10:30:47,731 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/foo.txt], includes = [null]
10:30:47,731 DEBUG [f.p.e.c.f.FsParser] [/foo.txt] can be indexed: [true]
10:30:47,731 DEBUG [f.p.e.c.f.FsParser]   - file: /foo.txt
10:30:47,731 DEBUG [f.p.e.c.f.FsParser] fetching content from [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents],[foo.txt]
10:30:47,734 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents, /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents/foo.txt) = /foo.txt
10:30:47,762 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated for PDF documents
10:30:48,161 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

10:30:48,285 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated
10:30:48,417 DEBUG [f.p.e.c.f.FsParser] Indexing documents/doc/49b9cbd1646961698bc35826c0775a23?pipeline=null
10:30:48,418 DEBUG [f.p.e.c.f.FsParser] Looking for removed files in [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents]...
10:30:48,543 DEBUG [f.p.e.c.f.FsParser] Looking for removed directories in [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents]...
10:30:48,568 INFO  [f.p.e.c.f.FsParser] FS crawler is stopping after 1 run
10:30:48,640 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [documents]
10:30:48,640 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
10:30:48,640 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
10:30:48,678 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
^C10:34:37,115 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [documents]
10:34:37,115 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
10:34:37,116 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
10:34:37,116 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
10:34:37,116 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
10:34:37,116 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [documents] stopped

That helped me BTW to discover another issue (might be related) which is that the REST Client does not close. I had to type CTRL+C to stop it:

10:30:48,678 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
^C10:34:37,115 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [documents]

Second run

10:34:46,321 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
10:34:46,324 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
10:34:46,324 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
10:34:46,324 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
10:34:46,324 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
10:34:46,325 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
10:34:46,330 DEBUG [f.p.e.c.f.c.FsCrawler] Cleaning existing status for job [documents]...
10:34:46,333 DEBUG [f.p.e.c.f.c.FsCrawler] Starting job [documents]...
10:34:47,076 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch >= 5, so we can use ingest node feature
10:34:47,136 WARN  [f.p.e.c.f.c.FsCrawler] We found old configuration index settings in [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/config] or [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/config/documents/_mappings]. You should look at the documentation about upgrades: https://github.com/dadoonet/fscrawler#upgrade-to-23
10:34:47,136 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
10:34:47,144 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] FS crawler connected to an elasticsearch [6.3.2] node.
10:34:47,145 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [documents]
10:34:47,164 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [documents_folder]
10:34:47,174 DEBUG [f.p.e.c.f.FsParser] creating fs crawler thread [documents] for [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents] every [15m]
10:34:47,175 INFO  [f.p.e.c.f.FsParser] FS crawler started for [documents] for [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents] every [15m]
10:34:47,175 DEBUG [f.p.e.c.f.FsParser] Fs crawler thread [documents] is now running. Run #1...
10:34:47,187 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents, /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents) = /
10:34:47,191 DEBUG [f.p.e.c.f.FsParser] Indexing documents_folder/doc/80d7e9f67615b3abb9157d946223a19?pipeline=null
10:34:47,201 DEBUG [f.p.e.c.f.FsParser] indexing [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents] content
10:34:47,201 DEBUG [f.p.e.c.f.c.FileAbstractor] Listing local files from /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents
10:34:47,209 DEBUG [f.p.e.c.f.c.FileAbstractor] 1 local files found
10:34:47,209 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents, /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents/foo.txt) = /foo.txt
10:34:47,209 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/foo.txt], includes = [null], excludes = [null]
10:34:47,209 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/foo.txt], excludes = [null]
10:34:47,209 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/foo.txt], includes = [null]
10:34:47,209 DEBUG [f.p.e.c.f.FsParser] [/foo.txt] can be indexed: [true]
10:34:47,210 DEBUG [f.p.e.c.f.FsParser]   - file: /foo.txt
10:34:47,210 DEBUG [f.p.e.c.f.FsParser] fetching content from [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents],[foo.txt]
10:34:47,212 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents, /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents/foo.txt) = /foo.txt
10:34:47,242 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated for PDF documents
10:34:47,608 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

10:34:47,732 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated
10:34:47,860 DEBUG [f.p.e.c.f.FsParser] Indexing documents/doc/49b9cbd1646961698bc35826c0775a23?pipeline=null
10:34:47,861 DEBUG [f.p.e.c.f.FsParser] Looking for removed files in [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents]...
10:34:48,058 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents, /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents/foo.txt) = /foo.txt
10:34:48,059 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/foo.txt], includes = [null], excludes = [null]
10:34:48,062 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/foo.txt], excludes = [null]
10:34:48,062 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/foo.txt], includes = [null]
10:34:48,062 DEBUG [f.p.e.c.f.FsParser] Looking for removed directories in [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents]...
10:34:48,083 INFO  [f.p.e.c.f.FsParser] FS crawler is stopping after 1 run
10:34:48,113 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [documents]
10:34:48,113 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
10:34:48,113 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
10:34:48,135 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
10:34:48,142 WARN  [f.p.e.c.f.c.ElasticsearchClientManager] Got a hard failure when executing the bulk request
org.apache.http.ConnectionClosedException: Connection closed unexpectedly
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.closed(HttpAsyncRequestExecutor.java:140) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.client.InternalIODispatch.onClosed(InternalIODispatch.java:71) [httpasyncclient-4.1.2.jar:4.1.2]
	at org.apache.http.impl.nio.client.InternalIODispatch.onClosed(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2]
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.disconnected(AbstractIODispatch.java:100) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionClosed(BaseIOReactor.java:279) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processClosedSessions(AbstractIOReactor.java:440) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:283) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [httpcore-nio-4.4.5.jar:4.4.5]
	at java.lang.Thread.run(Thread.java:844) [?:?]
10:34:48,142 WARN  [f.p.e.c.f.c.ElasticsearchClientManager] Got a hard failure when executing the bulk request
org.apache.http.ConnectionClosedException: Connection closed unexpectedly
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.closed(HttpAsyncRequestExecutor.java:140) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.client.InternalIODispatch.onClosed(InternalIODispatch.java:71) [httpasyncclient-4.1.2.jar:4.1.2]
	at org.apache.http.impl.nio.client.InternalIODispatch.onClosed(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2]
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.disconnected(AbstractIODispatch.java:100) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionClosed(BaseIOReactor.java:279) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processClosedSessions(AbstractIOReactor.java:440) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:283) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.5.jar:4.4.5]
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [httpcore-nio-4.4.5.jar:4.4.5]
	at java.lang.Thread.run(Thread.java:844) [?:?]
10:34:48,154 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
10:34:48,165 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [documents] stopped
10:34:48,166 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [documents]
10:34:48,167 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
10:34:48,167 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
10:34:48,167 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
10:34:48,167 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
10:34:48,167 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [documents] stopped

This time I don't have to CTRL+C but it generated the error you are also getting.

Third run

10:34:53,994 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
10:34:53,996 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
10:34:53,997 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
10:34:53,997 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
10:34:53,998 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
10:34:53,998 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
10:34:54,002 DEBUG [f.p.e.c.f.c.FsCrawler] Cleaning existing status for job [documents]...
10:34:54,005 DEBUG [f.p.e.c.f.c.FsCrawler] Starting job [documents]...
10:34:54,709 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch >= 5, so we can use ingest node feature
10:34:54,763 WARN  [f.p.e.c.f.c.FsCrawler] We found old configuration index settings in [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/config] or [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/config/documents/_mappings]. You should look at the documentation about upgrades: https://github.com/dadoonet/fscrawler#upgrade-to-23
10:34:54,764 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
10:34:54,771 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] FS crawler connected to an elasticsearch [6.3.2] node.
10:34:54,772 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [documents]
10:34:54,789 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [documents_folder]
10:34:54,799 DEBUG [f.p.e.c.f.FsParser] creating fs crawler thread [documents] for [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents] every [15m]
10:34:54,800 INFO  [f.p.e.c.f.FsParser] FS crawler started for [documents] for [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents] every [15m]
10:34:54,801 DEBUG [f.p.e.c.f.FsParser] Fs crawler thread [documents] is now running. Run #1...
10:34:54,812 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents, /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents) = /
10:34:54,815 DEBUG [f.p.e.c.f.FsParser] Indexing documents_folder/doc/80d7e9f67615b3abb9157d946223a19?pipeline=null
10:34:54,822 DEBUG [f.p.e.c.f.FsParser] indexing [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents] content
10:34:54,822 DEBUG [f.p.e.c.f.c.FileAbstractor] Listing local files from /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents
10:34:54,830 DEBUG [f.p.e.c.f.c.FileAbstractor] 1 local files found
10:34:54,830 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents, /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents/foo.txt) = /foo.txt
10:34:54,830 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/foo.txt], includes = [null], excludes = [null]
10:34:54,830 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/foo.txt], excludes = [null]
10:34:54,830 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/foo.txt], includes = [null]
10:34:54,831 DEBUG [f.p.e.c.f.FsParser] [/foo.txt] can be indexed: [true]
10:34:54,831 DEBUG [f.p.e.c.f.FsParser]   - file: /foo.txt
10:34:54,831 DEBUG [f.p.e.c.f.FsParser] fetching content from [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents],[foo.txt]
10:34:54,834 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents, /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents/foo.txt) = /foo.txt
10:34:54,865 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated for PDF documents
10:34:55,245 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

10:34:55,375 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated
10:34:55,503 DEBUG [f.p.e.c.f.FsParser] Indexing documents/doc/49b9cbd1646961698bc35826c0775a23?pipeline=null
10:34:55,503 DEBUG [f.p.e.c.f.FsParser] Looking for removed files in [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents]...
10:34:55,682 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents, /Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents/foo.txt) = /foo.txt
10:34:55,682 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/foo.txt], includes = [null], excludes = [null]
10:34:55,685 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/foo.txt], excludes = [null]
10:34:55,685 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/foo.txt], includes = [null]
10:34:55,685 DEBUG [f.p.e.c.f.FsParser] Looking for removed directories in [/Users/dpilato/Documents/Elasticsearch/work/fscrawler/547/documents]...
10:34:55,708 INFO  [f.p.e.c.f.FsParser] FS crawler is stopping after 1 run
10:34:55,722 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [documents]
10:34:55,723 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
10:34:55,723 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
10:34:55,746 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
^C10:34:59,248 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [documents]
10:34:59,248 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
10:34:59,248 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
10:34:59,249 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
10:34:59,249 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
10:34:59,249 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [documents] stopped

It worked again but I had to CTRL+C:

10:34:55,746 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
^C10:34:59,248 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [documents]

I'm going to try to force a call to BulkProcessor.flush() which is supposed to happen when calling close()...

@dadoonet
Copy link
Owner

So I found a way to fix that. Instead of calling close(), I'm calling awaitClose() which is giving some time to properly close all the bulk processors.

Going to push the fix soon.

dadoonet added a commit that referenced this issue Jul 28, 2018
We were closing the bulk processors with `close()` which was
trying to immediately exit the bulk processor.

Calling `awaitClose(30, TimeUnit.SECONDS)` is giving 30 more seconds
to the bulk processor to flush first all existing request before
actually closing.

I thought that `close()` was doing the same behind the scene but
apparently not.

Closes #547.
@jeanp413
Copy link
Author

@dadoonet Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For confirmed bugs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants