Skip to content

Detect Catastrophic backtracking when using the pattern_replace char_filter #17934

@tomfotherby

Description

@tomfotherby

I'm using the Pattern Replace Char Filter to filter out urls from my query text before it is tokenised. But I found that some input strings cause ElasticSearch to crash - It would be good if ElasticSearch could timeout or detect problems with the regexp instead of "hanging" forever (slowly eating more CPU and memory).


Example

// Character filters are used to "tidy up" a string *before* it is tokenized.
'char_filter' => [
    'url_removal_pattern' => [
        'type'        => 'pattern_replace',
        'pattern'     => '(?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))',
        'replacement' => '',
    ],

This filter was working fine for some weeks until suddenly ElasticSearch started crashing. We found someone was trying to do a javascript injection attack in our search box.

I pasted the regex and the attack string into https://regex101.com

  • Regexp:
    • (?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s!()\[\]{};:\'".,<>?«»“”‘’]))
  • Test string:
    • twitter.com/widgets.js\";fjs.parentNode.insertBefore(js,fjs);}}(document,\"script\",\"twitter-wjs\"

https://regex101.com shows the problem to be "Catastrophic backtracking"

Catastrophic backtracking has been detected and the execution of your expression has been halted. To find out more what this is, please read the following article: Runaway Regular Expressions.

It would be great if ElasticSearch could detect "Catastrophic backtracking" and throw a error. It would be great if there was a parameter on pattern_replace to provide a timeout, any queries that go over this timeout would just not match, rather than cause a hang.


As an aside, I created a unit test for our PHP application that uses the same regexp and test string. (PHP can understand the same regexp, even though it's obviously for Java in the ElasticSearch case) . Interestingly in php, the regex results in null which is the documented response of preg_replace when a error occurs. If PHP can return a error rather than crashing - surely ElasticSearch can too :trollface: ?

namespace app\tests\unit;
use \yii\codeception\TestCase;

class TagsControllerTest extends TestCase
{
    public function testRegexForURLDetection()
    {
        $regex = '(?mi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
        // Test the Catastrophic backtracking problem
        $testString = "twitter.com/widgets.js\";fjs.parentNode.insertBefore(js,fjs);}}(document,\"script\",\"twitter-wjs\"";
        // This shows the regex is not working for our test string - it gives null but should give 'hello '
        $this->assertEquals(null, preg_replace("/$regex/", '', "hello $testString"));
    }
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions