Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage data sanitization #11

Closed
curusarn opened this issue Jun 26, 2019 · 7 comments
Closed

Usage data sanitization #11

curusarn opened this issue Jun 26, 2019 · 7 comments
Labels
text draft There is some useful text for when you write your thesis

Comments

@curusarn
Copy link
Owner

My idea is to replace sensitive info with placeholders.

It's important to make sure that the same piece of information is always replaced with the same placeholder.

Replace sensitive info with hashes.

@curusarn
Copy link
Owner Author

curusarn commented Jun 26, 2019

paths -> replace dirs except for common ones
git url -> replace url, replace dirs of path

replace parts of arguments if possible:
scp user@server.dev:path/path -> scp HASH@HASH:HASH/HASH

replace whole arguments if they don't match any special form
git commit -m "message" -> git commit -m HASH

keep standard arguments and commands (?)

@curusarn
Copy link
Owner Author

Almost done: https://github.com/curusarn/resh/tree/dev_2

@curusarn
Copy link
Owner Author

curusarn commented Aug 11, 2019

I'm handling different types of data differently.

Types

Single value entries

e.g. username, hostname (usually sensitive information)

  1. replace with its hash
    • no exceptions, no whitelist

Paths

  1. split by /
  2. repace each part by its hash
    • unless it's in the whitelist or it's only one character
  3. append together

Git origin URL

  1. parse the URL using this library https://github.com/whilp/git-urls
  2. repace each part by its hash
    • unless it's in the whitelist or it's only one character long
  3. get a string of sanitized URL

Command line

I need to replace the command and arguments separately so that I can analyze partial matches later.
However, I don't want to parse bash.
I'm doing the following:

  1. split the line into consecutive strings of letters and/or digits (tokens)
    • command options are detected and left unhashed
  2. replace each token with its hash
    • unless it's in the whitelist or it's only one character long
  3. append together

Whitelisting

I created a whitelist containing various common strings.

  • directories in /
  • commands installed by default on Ubuntu, Debian or Fedora
  • bash and zsh keywords and builtins
  • file-extensions
  • git subcommands
  • some more stuff added by hand:
    • "com", "cz", ...
    • "vim", "emacs", ...
    • "Makefile", "Dockerfile", ...
    • ...

TL;DR

I pretty much hash everything except:

  • Commandline options
  • All non-alphanumeric chars
  • Single-letter or single-digit strings
  • Anything whitelisted (see above^)

@curusarn
Copy link
Owner Author

curusarn commented Aug 11, 2019

  • TODO: add more file extensions to the whitelist

  • TODO: show this to people

@curusarn
Copy link
Owner Author

I have shown this to 3 of my colleagues. Everyone was okay with the result. I got a suggestion that data is sanitized too much.

@curusarn
Copy link
Owner Author

curusarn commented Sep 3, 2019

I have found a couple of file extension databases. However, they don't seem very fond of other people using their data. I have asked FileInfo.com for permission to use their data.

  • WAIT for fileinfo.com to message you back

@curusarn
Copy link
Owner Author

curusarn commented Sep 3, 2019

I have added a few of common TLDs to the list. Source: https://www.hayksaakian.com/most-popular-tlds/

@curusarn curusarn added the text draft There is some useful text for when you write your thesis label Sep 3, 2019
@curusarn curusarn closed this as completed Oct 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
text draft There is some useful text for when you write your thesis
Projects
None yet
Development

No branches or pull requests

1 participant