Skip to content

crocs-muni/classifyRSAkey

Repository files navigation

Measuring Popularity of Cryptographic Libraries in Internet-Wide Scans

Matus Nemec, Dusan Klinec, Petr Svenda, Peter Sekan, and Vashek Matyas

We measure the popularity of cryptographic libraries in large datasets of RSA public keys. We do so by improving a recently proposed method based on biases introduced by alternative implementations of prime selection in different cryptographic libraries. We extend the previous work by applying statistical inference to approximate a share of libraries matching an observed distribution of RSA keys in an inspected dataset (e.g., Internet-wide scan of TLS handshakes). The sensitivity of our method is sufficient to detect transient events such as a periodic insertion of keys from a specific library into Certificate Transparency logs and inconsistencies in archived datasets.

We apply the method on keys from multiple Internet-wide scans collected in years 2010 through 2017, on Certificate Transparency logs and on separate datasets for PGP keys and SSH keys. The results quantify a strong dominance of OpenSSL with more than 84% TLS keys for Alexa 1M domains, steadily increasing since the first measurement. OpenSSL is even more popular for GitHub client-side SSH keys, with a share larger than 96%. Surprisingly, new certificates inserted in Certificate Transparency logs on certain days contain more than 20% keys most likely originating from Java libraries, while TLS scans contain less than 5% of such keys.

Since the ground truth is not known, we compared our measurements with other estimates and simulated different scenarios to evaluate the accuracy of our method. To our best knowledge, this is the first accurate measurement of the popularity of cryptographic libraries not based on proxy information like web server fingerprinting, but directly on the number of observed unique keys.

Classification tool for the ACSAC 2017 paper

Java tool used to estimate the share of cryptographic libraries from datasets of RSA keys. The measurement is based on subtle biases in different key generation methods used by cryptographic libraries, as introduced at the USENIX Security 2016 conference.

Conference website: https://www.acsac.org/2017

Paper website: https://www.acsac.org/2017/openconf/modules/request.php?module=oc_program&action=summary.php&id=106

Project details: https://crocs.fi.muni.cz/public/papers/acsac2017

Data processing (TLS, PGP): https://github.com/crocs-muni/acsac2017-data-tools

Data processing (Certificate Transparency): https://github.com/crocs-muni/acsac2017-certificate-transparency-java

This project extends the work presented in the paper "The Million-Key Question – Investigating the Origins of RSA Public Keys"

The tool is a fork of the classifyRSAkey tool. Original project details: https://crocs.fi.muni.cz/public/papers/usenix2016

Note: this version substantially changes the purpose (from classification of a single key to measurement in a large dataset), functionality, implementation and CLI of the tool. If you want to use the original tool, look for a relevant release of the classifyRSAkey tool.

Build

git clone https://gitlab.fi.muni.cz/xnemec1/classifyRSAkey
cd classifyRSAkey/
ant

Run

cd out/artifacts/classifyRSAkey_jar/
java -jar classifyRSAkey.jar

Basic usage

Classification of dataset of RSA keys - produces estimate of library popularity

Overview for the options of the classification (-c prefix):

  • -t table = path to classification table file (.json)
  • -i in... = path(s) to data set(s) - large dataset of RSA keys
  • -o outdir = path to folder for storing results
  • -b batch = source|primes|modulus_hash|none = how to batch keys - keys are assumed to be generated by the same library
    • source - create batch, if keys share source
    • primes - create batch, if keys share a prime
    • modulus_hash - create batch, if keys share modulus
    • none - do not make any batches
  • -p prior = estimate|uniform|table = prior probability
    • estimate - estimate prior probability from the dataset
    • uniform - assume libraries are equally popular
    • table - use prior probability specified in the classification table
  • -e export = none|json|csv = annotated dataset export format (each key from input is assigned vector of probabilities)
    • none - do not export
    • json - export in json (-b must not be none, -m must not be none)
    • csv - export in csv (currently not supported)
  • -m temp = none|disk|memory = temporary memory handling - only for export
    • none - no export done
    • disk - do not keep dataset in memory, read from disk again
    • memory - keep whole dataset in memory (for smaller datasets)
Example 1 - estimate the proportion of libraries
java -jar classifyRSAkey.jar -c -t in_classification_table.json -i in_rsa_unique_keys.json -o out_dir -p estimate -b none -e none

Explanation of options:

  • -t in_classification_table.json = classification table, output of -m
  • -i in_rsa_unique_keys.json = unique RSA public keys, one JSON object per line
    • {"n":"0x<RSA modulus in hexadecimal>", "e":"0x<public exponent in hexadecimal>", "source":[<list of strings, e.g., validity, common name>]}
    • if the original dataset contains duplicate RSA moduli, first obtain unique keys using -um, -uf or -rd
    • see also example RSA public keys - public PGP dataset from April 2017
  • -o out_dir = directory for output - following outputs will be created in out_dir/in_rsa_unique_keys.json/:
    • prior_probability.json - key/value json:
      • "probability" - group name to group probability
      • "groups" - group name to list of libraries
      • "frequencies" - feature mask to number of keys
    • individual_statistics.csv, overall_statistics.csv - legacy output of classification
  • -p estimate = estimate prior probability from data (rather than using one from table or uniform prior probability)
  • -b none = do not create batches of keys (rather than batching them by source or modulus_hash)
  • -e none = do not export keys (do not perform key classification at the end)
Example 2 - estimate the proportion of libraries, use the estimate as prior probability to classify the keys
java -jar classifyRSAkey.jar -c -t in_classification_table.json -i in_rsa_unique_keys.json -o out_dir -p estimate -b modulus_hash -e json -m memory

Explanation of different options:

  • -b modulus_hash = do not make any assumption on origin of keys (e.g., you could assume keys with the same source to be generated by the same library)
  • -e json = after library popularity estimate, use it as prior probability and classify every key; output will be written into json
  • -m memory = read dataset only once and keep it in memory; for large dataset use "disk" instead
  • out_dir will also contain dataset_in_rsa_unique_keys.json with the resulting classification of keys

Show information about groups in a classification table

java -jar classifyRSAkey.jar -i in_classification_table.json

Example output for precomputed classification table for ACSAC '17:

Group name: Group sources
Group  1: G&D SmartCafe 3.2
Group  2: G&D SmartCafe 4.x & 6.0
Group  3: GNU Crypto 2.0.1
Group  4: Gemalto GXP E64
Group  5: NXP J2A080 & J2A081 & J3A081 & JCOP 41 V2.2.1
Group  6: Oberthur Cosmo Dual 72K
Group  7: OpenSSL 0.9.7 & 1.0.2g & 1.0.2k & 1.1.0e
Group  8: PGPSDK 4 FIPS
Group  9: Infineon JTOP 80K, YubiKey 4 & 4 Nano
Group 10: NXP J2D081 & J2E145G, YubiKey NEO
Group 11: Bouncy Castle 1.54 (Java), Crypto++ 5.6.0 & 5.6.3 & 5.6.5, Libgcrypt 1.7.6 FIPS, Microsoft CryptoAPI & CNG & .NET
Group 12: Bouncy Castle 1.53 (Java), Cryptix JCE 20050328, FlexiProvider 1.7p7, HSM Utimaco Security Server Se50, Nettle 2.0, PolarSSL 0.10.0, PuTTY 0.67, SunRsaSign OpenJDK 1.8.0, mbedTLS 1.3.19 & 2.2.1 & 2.4.2
Group 13: Botan 1.5.6 & 1.11.29 & 2.1.0, Feitian JavaCOS A22 & A40, Gemalto GCX4 72K, HSM SafeNet Luna SA-1700, LibTomCrypt 1.17, Libgcrypt 1.6.0 & 1.6.5 & 1.7.6, Libgcrypt 1.6.0 FIPS & 1.6.5 FIPS, Nettle 3.2 & 3.3, Oberthur Cosmo 64, OpenSSL FIPS 2.0.12 & 2.0.14, PGPSDK 4, WolfSSL 2.0rc1 & 3.9.0 & 3.10.2, cryptlib 3.4.3 & 3.4.3.1

Create a custom classification table from raw keys

java -jar classifyRSAkey.jar -m in_makefile.json out_classification_table.json

Raw keys:

Remove duplicate moduli in a dataset of public keys

java -jar classifyRSAkey.jar -rd in_rsa_keys.json out_rsa_unique_keys.json
  • in_rsa_keys.json - RSA public keys, one JSON object per line
    • {"n":"0x<RSA modulus in hexadecimal>", "e":"0x<public exponent in hexadecimal>", "source":[<list of strings, e.g., validity, common name>]}
  • out_rsa_unique_keys.json - unique RSA keys
    • adds "occurrence":<number of duplicities>, collects source information
  • instead of -rd, you can use -um, if source information does not need to be stored - only the first occurence of key (based on "modulus" or "n") will be output
  • instead of -rd, you can use -uf, if keys contain "fprint" - e.g., a fingerprint of the certificate - only the first occurence of key with the fingerprint will be output

Default output - usage help

RSAKeyAnalysis tool, CRoCS 2017
Options:
  -h                   Show this help.
  -c   OPTIONS         Classify keys from key set.
                        OPTIONS = -t table -i in... -o outdir -b batch -p prior -e export -m temp 
                         -t table  = path to classification table file
                         -i in...  = path(s) to data set(s)
                         -o outdir = path to folder for storing results
                         -b batch  = source|primes|modulus_hash|none = how to batch keys
                         -p prior  = estimate|uniform|table = prior probability
                         -e export = none|json|csv = annotated dataset export format
                         -m temp   = none|disk|memory = temporary memory handling - only for export
  -i   table           Load classification table and show information about it.
                        table = path to classification table file
  -m   make  out       Build classification table from makefile.
                        make  = path to makefile
                        out   = path to json file (classification table file)
  -mc -t table         Compute misclassification rate.
                         -t table = path to classification table file
  -cs  OPTIONS         Compute classification success.
                        OPTIONS = -t table -o outdir -c keys [-s seed]
                         -t table  = path to classification table file
                         -o outdir = path to folder for storing results
                         -c keys   = number of keys in simulations
                         -s seed   = optional seed for RNG
  -er  table out       Export raw table (used to generate dendrogram).
                        table = path to classification table file
                        out   = path to csv file
  -ed  table out       Create table showing euclidean distance of sources.
                        table = path to classification table file
                        out   = path to html file
  -ec  table out       Convert classification table to csv format.
                        table = path to classification table file
                        out   = path to csv file
  -um -i in... -o out  Print only keys with first occurrence of modulus|n.
  -uf -i in... -o out  Print only keys with first occurrence of fprint.
  -rd -i in... -o out  Remove duplicities from key set (collecting attributes).
                        -i in...   = paths to key sets (processed individually)
                        -o outdir  = directory for unique datasets
  -ps OPTIONS          Partially sort the dataset to make duplicity removal easier.
                        OPTIONS = -i in... -o outdir -tmp temp -c prefix_bits 
                        -i   in... = paths to datasets
                        -o   out   = path to output directory for presorted dataset
                        -tmp temp  = directory for temporary files
                        -c   bits  = number of prefix bits (makes 2^b temporary files)
  -debug               Show debug and deprecated options.