Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve UA parsing #66

Closed
arp242 opened this issue Nov 22, 2019 · 12 comments · Fixed by #261
Closed

Improve UA parsing #66

arp242 opened this issue Nov 22, 2019 · 12 comments · Fixed by #261

Comments

@arp242
Copy link
Owner

arp242 commented Nov 22, 2019

github.com/mssola/user_agent isn't always reliable. I took a look in to fixing it, but it's not so easy.

I just noticed there's also https://github.com/avct/uasurfer, which may give better results.

This would also allow storing just the calculated result ("Firefox 70.0") instead of the full UA string, which sometimes contains quite a lot of information.

@arp242 arp242 added this to the Unplanned milestone Nov 22, 2019
@arp242
Copy link
Owner Author

arp242 commented Dec 17, 2019

uasurfer also doesn't seem that great; on a few test runs I got a lot of wrong data; see: 4143a04

@arp242
Copy link
Owner Author

arp242 commented Dec 28, 2019

Another possible project: https://github.com/ua-parser/uap-go

@arp242
Copy link
Owner Author

arp242 commented Dec 29, 2019

@arp242 arp242 modified the milestones: Unplanned, Version 1.2 Jan 13, 2020
@ptman
Copy link

ptman commented Jan 14, 2020

I would seriously recommend still storing the full UA header/string. But maybe store it normalized. E.g. a reference from requests to ua table to save space. The UA table will probably end up fully cached.

@arp242
Copy link
Owner Author

arp242 commented Jan 14, 2020

The problem here is a legal/ethical one @ptman, not a performance/space one. Storing the full User-Agent header makes it easier to identify persons based on the statistical data, and I'd like to make that harder when possible.

Just "Firefox 72" is both useful and quite anonymous, but Mozilla/5.0 (X11; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 leaks my OS as well, and especially a lot of mobile browsers send a ridiculous amount of information detailing the OS version, device model, device build version, and language. Here's an example of that:

Mozilla/5.0 (Linux; U; Android 9; fr-fr; Redmi Note 8 Pro Build/PPR1.180610.011) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/71.0.3578.141 Mobile Safari/537.36 XiaoMi/MiuiBrowser/11.1.7-g

It's ridiculous that they're sending this in the first place, but that's not something in my power to fix.

I considered normalizing as well; for example we can probably get away with removing the data between parent ((X11; Linux x86_64; rv:72.0)) for all User-Agent strings, which would already be an improvement as a lot – though not all – excessive information is contained in there, but I need to run tests and see how well that works out.

So in short, I'm not 100% sure yet what the best solution is here yet.

@ptman
Copy link

ptman commented Jan 14, 2020

UA strings are useful for debugging and also for grouping different clients that ignore cookies. It's data sent willingly from the browser, not something you have to go digging around to extract. Operating systems can make a huge difference in browser behaviour. And it's something that by default ends up in httpd logs. I understand the desire for privacy, but I would just store the whole UA string. Especially since they have been tricky to parse in the past and can be tricky to parse in the future.

@arp242
Copy link
Owner Author

arp242 commented Jan 14, 2020

Yeah, I appreciate there are advantages to storing it as well, which is why that is what GoatCounter is doing now. It's a bit of a tricky balancing act. Aside from that "the right thing" to do here, there is also the legal aspect to consider; the GDPR specifically mentions:

Natural persons may be associated with online identifiers provided by their devices, applications, tools and protocols, such as internet protocol addresses, cookie identifiers or other identifiers such as radio frequency identification tags. This may leave traces which, in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them

Does this cover these kind of User-Agent strings? Possibly.

data sent willingly from the browser

I don't think most users have knowledge that the full device info and language is being sent.

@ptman
Copy link

ptman commented Jan 14, 2020

I'm not a GDPR lawyer, but UA strings are ok in logs, AFAIK. GDPR allows processing information for different purposes. One being consent. But logs aren't processed based on consent. It probably "for legitimate interests of data controller", i.e. technical maintenance, troubleshooting, debugging etc. One could argue that UA strings are an old technical debugging device that helps with maintenance. E.g. identifying scrapers etc.

@arp242
Copy link
Owner Author

arp242 commented Jan 14, 2020

Yeah, maybe. I think with the lack of case law and inconsistent interpretations right now no one can really tell how it applies here for sure.

@arp242
Copy link
Owner Author

arp242 commented Jan 14, 2020

https://groups.google.com/a/chromium.org/forum/m/#!msg/blink-dev/-2JIRNMWJ7s/yHe4tQNLCgAJ

I was aware of client hints, but no idea things were going to move this fast...

@DanielRuf
Copy link

UA strings were never reliable and will not be very relevant in the near future.

@arp242
Copy link
Owner Author

arp242 commented Jan 20, 2020

Just because it's not 100% reliable doesn't mean it's not useful. It's mostly accurate and gives a good indication of which browsers people are using, which is useful in making decisions about browser support and the like.

I don't know what the future will hold. I know about Google's recently announced plans (linked above) but older browsers won't implement that, and it's especially useful to see if people are using older browsers. I suspect it will still be useful for several years to come.

@arp242 arp242 modified the milestones: Version 1.2, Planned Mar 18, 2020
@arp242 arp242 modified the milestones: Planned, Version 1.3 Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants