Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange User Agent strings don't get logged #386

Closed
jesusbagpuss opened this issue Mar 18, 2016 · 6 comments
Closed

Strange User Agent strings don't get logged #386

jesusbagpuss opened this issue Mar 18, 2016 · 6 comments

Comments

@jesusbagpuss
Copy link
Contributor

@jesusbagpuss jesusbagpuss commented Mar 18, 2016

Just spotted this in the apache error log
DBD::mysql::st execute failed: Incorrect string value: '\xC6\xC6\xBD\xE2\xBA\xF3...' for column 'requester_user_agent' at row 1 at /usr/share/eprints/perl_lib/EPrints/Database.pm line 1184.

User-Agent:
"\xc6\xc6\xbd\xe2\xba\xf3\xb5\xc4"

Not quite sure what this is meant to be - but it doesn't go into the database cleanly!
Might need to do some form of sanity check on the User-Agent.

@denics
Copy link
Contributor

@denics denics commented Mar 18, 2016

It is the user agent of one of the most used browsers in the world :)
hex -> ascii gives: ÆƽâºóµÄ
http://stats.crc.uiuc.edu/2013/sep13/CDL_No_Spider/Browsers.htm

@jesusbagpuss
Copy link
Contributor Author

@jesusbagpuss jesusbagpuss commented Mar 24, 2016

@jesusbagpuss
Copy link
Contributor Author

@jesusbagpuss jesusbagpuss commented Apr 6, 2016

Not sure whether this would help:
https://mathiasbynens.be/notes/mysql-utf8mb4
Apparently, utf-8 in MySQL means '3-byte-UTF-8' sequences!?

@phluid61
Copy link
Contributor

@phluid61 phluid61 commented Feb 23, 2017

Ah, good old GB2312. Transcoding those bytes from GB2312 to UTF-8 we get "\xe7\xa0\xb4\xe8\xa7\xa3\xe5\x90\x8e\xe7\x9a\x84", which is perfectly reasonable 3-byte UTF-8 (破解后的 - "after the crack"??).

The real problem is that the UA is sending an invalid 'User-Agent' header. From the RFCs:

 User-Agent = product *( RWS ( product / comment ) )
  product         = token ["/" product-version]

     token          = 1*tchar

     tchar          = "!" / "#" / "$" / "%" / "&" / "'" / "*"
                    / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
                    / DIGIT / ALPHA
                    ; any VCHAR, except delimiters

\xC6 is not in tchar, nor are any of its friends.

From what I've read and what I understand, there shouldn't be a problem sanitising rubbish by C-style backslash-escaping the raw octets. I don't mind the database literally holding the string "\\xc6\\xc6\\xbd\\xe2\\xba\\xf3\\xb5\\xc4". That would work if it was non-ASCII UTF-8, too \xf0\x9f\x98\x96

@jiadiyao
Copy link
Contributor

@jiadiyao jiadiyao commented Nov 7, 2017

It appears that to get octets, we needed the encode function:
$octets = encode('UTF-8', $characters, Encode::FB_CROAK); (https://perldoc.perl.org/Encode.html)
However, having tried various combination of encode, decode, utf8, gb2312, I cannot get it to show octets on my system (corrupted characters saved in my latin1 encoded mysql)
I used the following to simulate a ua string:
$ curl -A "破解后的" http://testgithub.local/id/eprint/77/
The problem could also be that my terminal is encoding the string differently than the browser in the field.

@phluid61
Copy link
Contributor

@phluid61 phluid61 commented Nov 7, 2017

You can either switch your terminal's encoding, or you can use some shell voodoo to make curl send the exact bytes in the User-Agent header field; for example:

echo -e "-A \xc6\xc6\xbd\xe2\xba\xf3\xb5\xc4" | xargs curl http://testgithub.local/id/eprint/77/

On our repository I've hacked some heuristic-based detection and conversion into some parts of our code, but that was a quick-fix workaround (for existing bad data) and a657554 presents a much more correct type of fix.

(Note: we've also changed our database to use utf8mb4 encodings and collations.)

drn05r added a commit that referenced this issue Apr 18, 2019
Fixes #386 by removing SQL injection vector using user agent string.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants