Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StackOverflow in completion suggester when using long input strings #3596

Closed
mattweber opened this Issue Aug 29, 2013 · 9 comments

Comments

Projects
None yet
4 participants
@mattweber
Copy link
Contributor

commented Aug 29, 2013

I ran into a StackOverflowError when using the new completion suggestions on a long string. I reproduced this using the latest build of master and a random string:

[2013-08-29 10:10:36,691][DEBUG][action.index             ] [Summers, Christopher] [music][4], node[MKSIixv6QQaAu-ws3rs9_g], [P], s[STARTED]: Failed to execute [index {[music][song][10], source[{
    "name" : "long string",
    "suggest" : { 
        "input": [ "GBDGUPXWENKY8HVZDZ0ZB6K0DU7QHAUC68JYRKCIU1GOLNDLRQF693S9UNB0OZL0TXL25LVUGNW03KOBXZK050HSXD37YJAA2JTEFGOG2552C0C0C6Z2G8WCH5Y78KDBY9D0950ZRPSMV3ZC3CPAZGP2D4QJ4FJE1ZQIQUYN6CBMUE2DDB4UV71AH4MR5E2F86NQN67I0STDK7I413DR6N5LSG1OK6YICL09VRIIN2FMKWEP242HYDG3GITQ4DPAMUL5N4S8HB7W7H7C6B7ZFTGO2RRLYB56GE44XD6B42XZVBP8IJROUHTPU7SFI59D1T11XH9FD6KDX7AV84K4KWHK48BQI1VZNZQRN1VC6YM8KLS4PGNGLFHRN302J92U90XM8H2AB94AFPFDTG559FHRX7ANRDUBYLPCR5U9O268HU783MOLP1A8TSU7NJWM6RG7ANPZL9NPQD1TRB3RPJNCT5IJS7QI6MI1HJG86XU8BXRLRU81ITKEXR1X7H8EXAKM1ER9DFMAS7O2WJB4W26Z6ETZ3UZKB55LDZEBW9XEDE7PW6ITA0CHUWXTUHSVID0ID5TTGG6U9J2Z72OV0ROI5KNRGDHYMT8O1S2GWA5EZFJJEGWS0R3SB8WVBL7XPCTD5WX314AEM8ZYW2APIM1GZXV17VLJJ9CF4TMD87X3SH4U1UQVOWCR27N4ZXH0UZ0VC545GPWK9EKJ8HLBRKIBGUHAESMJ930RF6SPMYVH4R02E54NBTBLOQRH92H49TJFT9SG3VU4WGV392DLIGTJUOH80RTIIL86N675GVTNUX7Z89FH8CU2S5XNW0FD6TG3DFU3WCDCJN03D6TYESIAZW6J7CCMMUZTNF6ON07BOGLMLYZRTQZB4O8MYOWQGI1185ALZSTO3HQFZBOSMHXQ7OKTZWHDI6O5SB1NWDF7PVBFNSM11QEGQKI6E52DVG44H3M4OIOT9MDZYOUBFVHV12E8T911YXSZJ74ZOM2JOUMBAM8ZOUFF5X9JAP6IY2E983LWK58BLPPEANQ6KRJZG256XQY2LJJMA7WESVGXMF2RP1XZ5FWCOCFCPHWI2FAYI68JN4N5EEKD3P3U6XALSYJXJWXWOSHT4FYDXDMR7W0VWVLX5C8YSU3SFYA2K0Q1YKVEP1RRWV30ZJMT9B5DDBWN8MHUHX7L3IROSZ6493S9IE8VFOMAYZ0HUMRXO6PS6JO0C54QB0BCUTJ4B1XBHIW6F7DNJMVI539YNERSUBAGGVQW3HR9SI53EODVSHKGI46WJICOCZG8CHD4LQBCWHFTREJ3JHVJGZ8884GGRX456VCGUAIXQL8PPHGW7S34U5KT9OY7O5ZXAFJF4IS9M1PIVU5VTLU44DJLSLQIHMNR5HMWDGIRI0WLHQE0KNRV7OL1JD9PGKUIG8JCI7D0VOT0BN2EQC23GSGL02SZE5314S1GHD1KENX1TD2SXWN272XR6BBL2BDH72XKF9R9482GDOO88ZN0DWTV0QI20SUOMAZQO6JEA7T3YR5FIU8TRX34YER1WPIZ20T7KR8JU71FAEXJYF74PFZF2BCRLN1E93HGN6GFYPTKESI2THT3R4YX7KA0M44G6TZ209SKSFAT2G5PUG0ERGKIHT1XIZTRJOGMG7T8NVCWWQSPIH3JAQ51E0ZXDFXSMJ31I7ABEWWDAF0KTNYRNWDE1R5V89HT9OLHDUZ7IXXRSYXUX2K3956MD0AF1NLGSS5U3D2QFPATN4AAW4IDYULEX67ZC5O1ZPQ9HGEGVFO0FQUFDG92WZIFEMIY0Z0Q66SUZ2N553D3VBLZIPGL2AHP11XEDJ5BU4R9ZJX3UWFCV6YCIXU5SRQEUYQZRQG7EDBQB8M7KIW4GIZQ3KMN8ISQUT5SRODI45O9GC5LG1P522U6092NTVMU7PLGEEGHLXPL1UCL3TCXQ8VS4SDW65MEVMWY9UNRWEHFMOVO7ABIYUUIU90FRIPATGZSGD7XOW5NWW4OYW8WQ91HSLVJUWUNIIDAU08GSZ654PXN5VMLNZ6N" ],
        "output": "long string",
        "payload" : { "artistId" : 2310 }, "weight": 1
    }
}]}]
java.lang.StackOverflowError
    at java.util.HashMap.addEntry(HashMap.java:856)
    at java.util.HashMap.put(HashMap.java:484)
    at java.util.HashSet.add(HashSet.java:217)
    at org.apache.lucene.util.automaton.SpecialOperations.getFiniteStrings(SpecialOperations.java:244)
    at org.apache.lucene.util.automaton.SpecialOperations.getFiniteStrings(SpecialOperations.java:259)
    at org.apache.lucene.util.automaton.SpecialOperations.getFiniteStrings(SpecialOperations.java:259)
    at org.apache.lucene.util.automaton.SpecialOperations.getFiniteStrings(SpecialOperations.java:259)
    at org.apache.lucene.util.automaton.SpecialOperations.getFiniteStrings(SpecialOperations.java:259)...
@s1monw

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2013

dude this is a long string... yet, I agree we need to protect the users from adding too long strings. I think we should just cut off at some point. Not sure what is a good default but maybe 30 chars? who types 30 chars in a suggest env. :)

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2013

You might want to let us configure the limit and advise on stack issues. We don't use the completion suggester yet but I can see users copying and pasting longish strings into my suggestion environment and expecting it to work. We limit the field to 255 characters which seems a bit long to me but I suspect would need more than 30. 100, maybe?

Also, I bet the data structure become a lot less awesome with long strings.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2013

You might want to let us configure the limit and advise on stack issues

yeah length should be configurable.

We limit the field to 255 characters which seems a bit long to me but I suspect would need more than 30. 100, maybe?

I don't think this is needed, even if you paste a longish string in the search bar you will get a suggestion but only for the top N characters. I am not sure if we should provide any if you exceed the limit. your query might be specific enough?

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2013

Yea, I agree this is long. This came up while indexing some user generated data and not doing a length check. We are using 30 character max now and that works great.

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2013

Admittedly this is kind of lame, but what if someone pasted Wikipedia:List of Wikipedians by featured article nominations into the search bar and we limited it to 30 characters? We'd end up searching for Wikipedia:List of Wikipedians which would drive people wild. I'm not sure what the right answer is. Maybe switch to a phrase prefix search if the string is too long.

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2013

Just to be clear this issue is during indexing and creation of the FST, not at search time to get suggestions.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2013

@nik9000 I think there is a misunderstanding here. We would always search for the entire string. But the prefix suggester will will not build it's prefix dictionary for more than N leading characters. I think this should be perfectly fine and no other suggester impl is affected.

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2013

Oh sorry! So you'd still suggest the right thing even though the prefix dictionary doesn't spell it out exactly. Cool. Sorry for the confusion.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Sep 3, 2013

Add 'min_input_len' to completion suggester
Restrict the size of the input length to a reasonable size otherwise very
long strings can cause StackOverflowExceptions deep down in lucene land.
Yet, this is simply a saftly limit set to `50` UTF-16 codepoints by default.
This limit is only present at index time and not at query time. If prefix
completions > 50 UTF-16 codepoints are expected / desired this limit should be raised.
Critical string sizes are beyone the 1k UTF-16 Codepoints limit.

Closes elastic#3596

@s1monw s1monw closed this in eb2fed8 Sep 3, 2013

s1monw added a commit that referenced this issue Sep 3, 2013

Add 'min_input_len' to completion suggester
Restrict the size of the input length to a reasonable size otherwise very
long strings can cause StackOverflowExceptions deep down in lucene land.
Yet, this is simply a saftly limit set to `50` UTF-16 codepoints by default.
This limit is only present at index time and not at query time. If prefix
completions > 50 UTF-16 codepoints are expected / desired this limit should be raised.
Critical string sizes are beyone the 1k UTF-16 Codepoints limit.

Closes #3596
@mikemccand

This comment has been minimized.

Copy link
Contributor

commented May 4, 2014

See also #5927 and https://issues.apache.org/jira/browse/LUCENE-5628 where we are fixing Lucene's getFiniteStrings to not consume Java stack in proportion to the character length of the index-time or query-time suggestion.

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Add 'min_input_len' to completion suggester
Restrict the size of the input length to a reasonable size otherwise very
long strings can cause StackOverflowExceptions deep down in lucene land.
Yet, this is simply a saftly limit set to `50` UTF-16 codepoints by default.
This limit is only present at index time and not at query time. If prefix
completions > 50 UTF-16 codepoints are expected / desired this limit should be raised.
Critical string sizes are beyone the 1k UTF-16 Codepoints limit.

Closes elastic#3596
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.