Skip to content

Java: StringLiteral and CharacterLiteral getValue() replaces unpaired surrogates with ? #6611

@Marcono1234

Description

@Marcono1234

Version

CodeQL CLI version: 2.6.0

Description of the issue

Slightly related to #5297

The predicate getValue() of CodeQL's StringLiteral and CharacterLiteral seems to replace unpaired Unicode surrogates (U+D800 - U+DBFF and U+DC00 - U+DFFF) with the character ?.
This is not a display problem in the Query Console or the VS Code extension; the database really seems to contain a ? as value.

This can lead to incorrect results for queries since the value reported by CodeQL does not match what the source code contains.

Reproduction steps

Run the following query:

import java

from StringLiteral s, string literal, string value
where
  literal = s.getLiteral()
  and value = s.getValue()
  // Value contains '?'
  and value.matches("%?%")
  // But literal does not contain '?'; neither literally nor escaped
  and not literal.matches(["%?%", "%\\u77%", "%\\u077%", "%\\u003f%", "%\\u003F%"])
select s, value

Query Console link

Workaround

A workaround might be to use getLiteral() which contains the Unicode escape sequences for the surrogate characters. However, then you have to manually parse escape sequences which is rather error-prone.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions