Ill-formed byte sequences should be validated #30

masakielastic · 2014-10-23T07:13:14Z

The validator bypasses Ill-formed byte sequences. The definition of UTF-8 string can be seen in RFC 3629 or "Table 3-7. Well-Formed UTF-8 Byte Sequences" in the Unicode Standard (from my answer on stackoverflow).

$validator = new EmailValidator;
$email = "\x80\x81\x82@\x83\x84\x85.\x86\x87\x88";

var_dump(
    true === $validator->isValid($email)
);

The way for validating UTF-8 string is using htmlspecialchars or preg_match.

function utf8_validate($str) {
    return $str === htmlspecialchars_decode(htmlspecialchars($str, ENT_QUOTES, 'UTF-8'));
}

function utf8_validate2($str) {
    return false !== preg_match('/./u', $str);
}

masakielastic · 2014-10-23T08:23:15Z

The part of C0 (U+0000 - U+000F) and C1(U+0080 – U+009F) are also bypassed. I could not found the range of U+0000 - U+000F in the definition of local part (RFC5322BNF.html)

use Egulias\EmailValidator\EmailValidator;

$validator = new EmailValidator;

for ($i = 0; $i < 0x100; ++$i) {

    $c = utf8_chr($i);
    $email = $c .'test@example.com';

    if ($validator->isValid($email)) {

        if (preg_match('/\p{Cc}/u', $c)) {
            $number = strtoupper(dechex($i));
            $length = strlen($number);
            $number = str_repeat('0', 4 - $length).$number;
            echo 'U+'.$number.' ';
        }
    }
}

echo PHP_EOL;

function utf8_chr($code_point) {

    if ($code_point < 0 || 0x10FFFF < $code_point || (0xD800 <= $code_point && $code_point <= 0xDFFF)) {
        return '';
    }

    if ($code_point < 0x80) {
        $hex[0] = $code_point;
        $ret = chr($hex[0]);
    } else if ($code_point < 0x800) {
        $hex[0] = 0x1C0 | $code_point >> 6;
        $hex[1] = 0x80  | $code_point & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]);
    } else if ($code_point < 0x10000) {
        $hex[0] = 0xE0 | $code_point >> 12;
        $hex[1] = 0x80 | $code_point >> 6 & 0x3F;
        $hex[2] = 0x80 | $code_point & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]);
    } else  {
        $hex[0] = 0xF0 | $code_point >> 18;
        $hex[1] = 0x80 | $code_point >> 12 & 0x3F;
        $hex[2] = 0x80 | $code_point >> 6 & 0x3F;
        $hex[3] = 0x80 | $code_point  & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]).chr($hex[3]);
    }

    return $ret;
}

egulias · 2014-10-26T22:13:11Z

Hi @masakielastic !
Great report, thanks!
Sorry for my late response but I've been pretty busy these days.
I'll work on this as soon as I can. If you can provide a PR I'll gladly merge it!

#30 - Improved utf8

egulias · 2014-11-29T09:13:44Z

Hi!
Can you check release 1.2.6 and close the issue if corresponds? Thanks!

egulias · 2015-01-04T22:50:36Z

@masakielastic I'll close this, please check version 1.2.7. If you find more issues, please create new ones.
Thanks!

egulias added the bug label Nov 2, 2014

egulias added a commit that referenced this issue Nov 17, 2014

#30 - [WIP] - improve utf-8 control

833eb65

egulias mentioned this issue Nov 17, 2014

Invalid email appears valid #36

Closed

egulias added a commit that referenced this issue Nov 29, 2014

#30 - Improved control for UTF8 chars

107b970

egulias added a commit that referenced this issue Nov 29, 2014

#30 - Improved control for UTF8 chars

9103f4f

egulias added a commit that referenced this issue Nov 29, 2014

Merge pull request #38 from egulias/improved-utf8

7a64ea1

#30 - Improved utf8

egulias closed this as completed Jan 4, 2015

arcaela mentioned this issue Aug 21, 2019

E_NOTICE Array offset Laravel 5.8 #212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ill-formed byte sequences should be validated #30

Ill-formed byte sequences should be validated #30

masakielastic commented Oct 23, 2014

masakielastic commented Oct 23, 2014

egulias commented Oct 26, 2014

egulias commented Nov 29, 2014

egulias commented Jan 4, 2015

Ill-formed byte sequences should be validated #30

Ill-formed byte sequences should be validated #30

Comments

masakielastic commented Oct 23, 2014

masakielastic commented Oct 23, 2014

egulias commented Oct 26, 2014

egulias commented Nov 29, 2014

egulias commented Jan 4, 2015