Code for cracking passwords with neural networks
JavaScript Python Other
Switch branches/tags
Nothing to show
Clone or download
wrmelicher Merge pull request #15 from jtan189/add-docker-support
Install seya using python3 and other fixes related to unit tests
Latest commit 44b3ba3 Jun 14, 2018
Permalink
Failed to load latest commit information.
configs another config Dec 2, 2015
js Let the neural network be configured at runtime by having configuration May 5, 2017
measurement Adding measurement scripts for yahoo + microsoft online meters Feb 5, 2016
pre_built_networks Added files and instructions for deployment via Docker. Apr 25, 2018
strategy_simulation Adding in-browser tests to create guessing curves. Dec 18, 2015
test_data Bug fix for rare character guessing and adding unit test Dec 2, 2015
utils Updating readme with more information about library versions and a to… Dec 5, 2016
.gitignore Adding pre-built networks to releases and updating README Mar 14, 2016
.theanorc.cpu Added files and instructions for deployment via Docker. Apr 25, 2018
.theanorc.gpu Added files and instructions for deployment via Docker. Apr 25, 2018
Dockerfile Install seya using python3. Jun 2, 2018
LICENSE Adding license Sep 1, 2016
README Added files and instructions for deployment via Docker. Apr 25, 2018
deploy.py Volume map test data directory so that pwd_unit_guess.py can find tes… Jun 2, 2018
markov_model.py Minor uncommited changes May 3, 2016
markov_model_tests.py Adding support for additive smoothing on backoff models Feb 2, 2016
parallel_generate_markov.sh Adding support for parallelizing guess generation Jan 31, 2016
pwd_guess.py Adding ability to specify config from command line for PGS Sep 1, 2016
pwd_guess_ctypes.pyx Supporting Monte-Carlo estimation of tokenized passwords Dec 26, 2015
pwd_guess_unit.py Use assertAlmostEqual to eliminate false positive unit test errors th… Jun 2, 2018
requirements.txt Added files and instructions for deployment via Docker. Apr 25, 2018
setup.py Attempting to change tokenized guessing to not duplicate ambiguously … Dec 16, 2015

README

Paper
-----

Fast, Lean, and Accurate: Modeling Password Guessability Using Neural Networks.
W. Melicher, Blase Ur, Sean M. Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor. USENIX Security 2016.
https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/melicher


TODO
----

- Update code to newer versions of keras and Theano.

- Refactor code to split into separate files.

- Change to YAML for configuration files to support comments and reduce the
  number of configuration writing errors.

- Remove support for things that are no longer used or keras no longer
  supports (e.g., bidirectional models, JSZ1).

- Make live demo of JavaScript guesser.

- Improve testing on the JavaScript guesser.

- Improve documentation about versions and check compatibility with previous
  versions of keras and Theano.

- Improve state of saving data in the intermediate_sqlite file. Its easy to end
  up with data in the intermediate files that doesn't match the training data
  and leads to obscure and sometimes silent errors.

- Improve performance for enumerating guesses.


Bugs
----

This is software used and maintained by students for a research project and
likely will have many bugs and issues.


Setup using Docker
------------------

Make sure you have installed the NVIDIA driver (https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#how-do-i-install-the-nvidia-driver) and Docker (https://docs.docker.com/install). For GPU support, additionally install nvidia-docker (https://github.com/NVIDIA/nvidia-docker).

Build a CPU-only container and start an interactive bash session within it:

    ./deploy.py build-cpu
    ./deploy.py run-cpu

Build a GPU-supported container and start an interactive bash session within it:

    ./deploy.py build-gpu
    ./deploy.py run-gpu

Note: You may need to specify python3 when executing python scripts within the Docker container, e.g. `python3 pwd_guess_unit.py`.


Setup (Manual)
--------------

  Requirements:

  + python - Version 3.4.2 was used during development. Should work with any
    version of python3

  + python packages:

    - theano - Theano requires the version from github instead of the version on
      pip.
      https://github.com/Theano/Theano. To setup the GPU, make sure that you
      read the documentation. Make the .theanorc file in your home directory
      with this:

        [cuda]
        root = /usr/local/cuda

        [global]
        device = gpu                  # change this to be gpu# if necessary
        floatX = float32
        warn_float64 = ignore

      Using the GPU will require that you have nvidia drivers installed and
      CUDA.

      Make sure that gcc is compatible with nvcc. At the time of writing, gcc
      version 4.9 is required. You can check this by executing:

      `which gcc` --version

      Theano 0.7.0-0.8.2 was used during development.

      If using 0.8.2, you may need to add the following lines to you .theanorc
      due to https://github.com/Theano/Theano/pull/4369. If you don't you might
      get errors like "WARNING (theano.sandbox.cuda): CUDA is installed, but
      device gpu is not available (error: cuda unavailable)":

        [nvcc]
        flags = -D_FORCE_INLINES

    - keras - at time of writing, the version on pip is not current and will
      cause model saving to fail. Use the version from github instead
      (https://github.com/fchollet/keras). Version 0.2.0 was the main keras
      version during development. However, during development, keras changed to
      version 0.3.1. Some commits support on version or the other. It is
      currently a todo item to improve the state of keras support. On my
      machine the current commit's tests pass with Keras commit
      1e58b895236f6a80f5e07de74af25f16d9cc4625.

    - scikit-learn - pip install scikit-learn

    - sqlitedict - pip install sqlitedict

    - numpy

    - cython

  Compiling:

    python3 setup.py build_ext --inplace

  Set up:

    - Cuda must be in path and library path. Add these two lines to your .bashrc
      file:

      export PATH="$PATH":/usr/local/cuda-7.5/bin
      export LD_LIBRARY_PATH="$LD_LIBRARY_PATH":/usr/local/cuda-7.5/lib64

Tests
-----

Run automated tests by:

    python pwd_guess_unit.py

Running all tests takes roughly 15 minutes on my machine. It may take more
depending on the GPU you are using.


or to run only specific tests:

    python -m unittest pwd_guess_unit.<specific unit test>


Help
----

python3 pwd_guess.py --help


usage: pwd_guess.py [-h] [--pwd-file PWD_FILE [PWD_FILE ...]]
                    [--arch-file ARCH_FILE] [--weight-file WEIGHT_FILE]
                    [--pwd-format {trie,tsv,list,im_trie} [{trie,tsv,list,im_trie} ...]]
                    [--enumerate-ofile ENUMERATE_OFILE] [--retrain]
                    [--config CONFIG] [--args ARGS] [--profile PROFILE]
                    [--log-file LOG_FILE]
                    [--log-level {debug,info,warning,error}] [--version]
                    [--pre-processing-only] [--stats-only]
                    [--config-args CONFIG_ARGS]
                    [--forked {guesser,random_walker}]
                    [--calc-probability-only] [--train-secondary-only]

Neural Network with passwords. This program uses a neural network to guess
passwords. This happens in two phases, training and enumeration. Either --pwd-
file or --enumerate-ofile are required. --pwd-file will give a password file
as training data. --enumerate-ofile will guess passwords based on an existing
model. Version <version number>

optional arguments:
  -h, --help            show this help message and exit
  --pwd-file PWD_FILE [PWD_FILE ...]
                        Input file name.
  --arch-file ARCH_FILE
                        Output file for the model architecture.
  --weight-file WEIGHT_FILE
                        Output file for the weights of the model.
  --pwd-format {trie,tsv,list,im_trie} [{trie,tsv,list,im_trie} ...]
                        Format of pwd-file input. "list" format is onepassword
                        per line. "tsv" format is tab separated values: first
                        column is the password, second is the frequency in
                        floating hex. "trie" is a custom binary format created
                        by another step of this tool.
  --enumerate-ofile ENUMERATE_OFILE
                        Enumerate guesses output file
  --retrain             Instead of training a new model, begin training the
                        model in the weight-file and arch-file arguments.
  --config CONFIG       Config file in json.
  --args ARGS           Argument file in json.
  --profile PROFILE     Profile execution and save to the given file.
  --log-file LOG_FILE
  --log-level {debug,info,warning,error}
  --version             Print version number and exit
  --pre-processing-only
                        Only perform the preprocessing step.
  --stats-only          Quit after reading in passwords and saving stats.
  --config-args CONFIG_ARGS
                        File with both configuration and arguments.
  --forked {guesser,random_walker}
                        Internal use only.
  --calc-probability-only
                        Only output password probabilities
  --train-secondary-only
                        Only train on secondary data.


Pretrained Network Usage
------------------------

Enumerating passwords

Edit guess_len8_config.json to replace "g1_len8.tsv" in the "enumerate_ofile"
key with the output file you would like.

If you want to guess more passwords, you should change the value of
"lower_probability_threshold" to something lower, e.g. 1e-8.

Passwords are not sorted, so if you want in order guessing, then sort the
output file by descending probability:

sort -gr -k2 -t$'\t' [OUTPUT_FILE] -o [SORTED_OUTPUT_FILE]


Monte Carlo Simulation

Edit guess_len8_config.json to replace "g1_len8.tsv" in the "enumerate_ofile"
key with the output file you would like. Edit "<input_file>" in the
"password_test_fname" key to set the password input file. This file should
point to a line-delimited password file where each line is one password.


Command:

python3 <path_to_root>/pwd_guess.py --config-args <config_file.json>

e.g.:

python3 ../pwd_guess.py --config-args guess_len8_config.json



Version
-------

python pwd_guess.py --version


Output format
-------------

delamico_random_walk - This output format performs a monte-carlo estimation of
the guess number, the strength of a password. The output file is a TSV where
each line has 7 fields: the password, the probability of that password, the
estimated output guess number (the strength of the password), the std deviation
of the randomized trial for this password (in units of number of guess), the
number of measurements for this password, the estimated confidence interval for
the guess number (in units of number of guesses).

human - This output format enumerates guesses and stores the list of passwords
guessed to the output file. The guesses are not in order of probability. The
otuput file is a TSV with each line having two fields: the password, and the
probability. You can sort the passwords by probability using the unix sort
command.

calculator - This output format calculates the exact number of guesses for a
test set of passwords by enumerating guesses. The output file is a TSV with 3
fields: the password, the probability for that password, and the guess number.

generate_random - This output format generates random passwords and stores them to
disk. The output is a TSV with 2 fields: the random password and its probability.



Config files
------------

Configuration information for guessing and training. Can be read from a file
in json format.

# Files Configuration Options:

intermediate_fname - File name to store intermediate information about
  processing relative to the current directory. A value of ':memory:' will
  store all values in memory. Default is ':memory:'. This is necessary if
  enumeration and training happen at different times.


Neural network Model Configuration:

char_bag - alphabet of characters over which to guess. By default this includes
  all keyboard keys (e.g., alphanumeric characters and some symbols).

model_type - type of model. Should be LSTM or GRU or JZS{1,2,3} (JZS1,2,3 are
  only supported in earlier versions of the Keras library).

hidden_size - Size of each layer hidden recurrent layer.

dense_layers - Number of additional dense layers.

dense_hidden_size - Size of dense layer.

layers - Number of hidden layers.

max_len - Maximum length of any password in training data. This can be
  larger than all passwords in the data and the network may output guesses
  that are this many characters long.

min_len - Minimum length of any password that will be guessed.

model_optimizer - Model optimizer. Default is 'adam'. Read about optimzer
  values from the Keras documentation: http://keras.io/optimizers/.

context_length - Number of context characters to use. Lower means less time to
  train, more could potentially increase accuracy.

generations - More generations means it takes longer but is more accurate.
  Default is 20.

dropouts - Use neural network drop out weights. If true, can prevent
  overfitting.

dropout_ratio - Ratio of dropouts.

train_backwards - If true, train on passwords backwards: e.g., guessing d from
  'rowssap' instead of guessing d from 'passwor'.

bidirectional_rnn - Only supported for some versions of Keras. If true, then
  use a Bidirectional version of the neural network model.

deep_model - If true, then train a deeper NN model. Set this to true if you
  use more than one layer in the 'layers' argument.

padding_character - If true, then use a padding character. This should
generally be false, but is included for backward compatibility. Models trained
before version 275 include a padding character.


# Training Configuration Options:

freq_format - can be 'hex' or 'decimal'. This defines the format of frequency
  integers in the training sets. Only applicable when using TSV format for
  input.

secondary_training - If true, use a secondary training set after the primary
  training set.

secondary_train_sets - Json dictionary in this format:

        "secondary_train_sets" : {
            "pwd_file" : [
                "<pwd_file>"
            ],
            "pwd_format" : [
                "list"
            ]
        }
    pwd_file is a list of files. pwd_format is a list of formats corresponding
    to each file. Accepts the same options as the --pwd-format argument.

freeze_feature_layers_during_secondary_training - If true, then during
  secondary training, the feature layers will be frozen. This is useful for
  avoiding overfitting to the secondary training set, especially if the
  secondary training set is significantly smaller than the primary set.

secondary_training_save_freqs - If true, then use the secondary training set
  for post-processing frequencies instead of the primary set.

training_chunk - Smaller training chunk means less memory consumed on
  the GPU. Larger value training chunk means more GPU memory consumed. Ideally,
  this value would be as large as possible without running out of memory on the
  GPU. Potentially, there is a possibility that large values also have lower
  quality training but I have not observed this to happen in practice.

chunk_print_interval - Interval over which to print info to the log.

train_test_ratio - Ratio of training data to holdout testing data. A value of
  20 means using one out of every 20 passwords for holdout testing. These
  passwords are only used to print accuracy statistics in the log data and for
  early-quit statistics. The logged accuracy statistics are only for diagnostic
  and debugging purposes and should not be used in a real test. To perform a
  real test, you should not give any test-passwords during training.

training_accuracy_threshold - If the accuracy is not improving by this
  amount each generation, then quit. Set to -1 to never quit early.

rare_character_optimization - Default false. If you specify a list of
  characters to treat as rare, then it will model those characters with a
  rare character. This will increase performance at the expense of accuracy.

rare_character_lowest_threshold - Default 20. The characters with the lowest
  frequency in the training data will be modeled as special characters. This
  number indicates how many to drop. A value of 20 means treating the 20 least
  frequent characters in the training set as rare characters.

uppercase_character_optimization - Default false. If true, uppercase
  characters will be treated the same as lower case characters. Uppercase
  characters will be predicted via post-processing output according to the
  frequency of uppercase characters in the training data.

no_end_word_cache - When rare_character_optimization or
  upper_case_character_optimization is used, it uses different post-processing
  percents for the first and last character. If no_end_word_cache is true, then
  only the first character has different post-processing values. The intuition
  for this is that uppercase characters are likely more probable as the first
  character and special characters more likely as the last character.

simulated_frequency_optimization - Default false. Only for TSV files. If set
  to true, then multiple instances of the same password are simulated. This
  can improve performance at the expense of accuracy.

save_always - Boolean. Default true. If false, then only the networks which
  perform best on verification data will be saved to disk.

save_model_versioned - Boolean. When saving the model, save each generation of
  the model using a different file name. You can use this to measure the effect
  of more generations on models. The first generation is saved as
  <model_file>.1, the second generation is saved in the file <model_file>.2,
  where <model_file> is the model file name given in the arguments.

randomize_training_order - If true, will randomize the passwords training
  order.

compute_stats - Compute pre-processing step and exit without training a neural
  network.

tokenize_words - If true, create a tokenized model.

most_common_token_count - If tokenize_words is true, then this is the number of
  tokens to simulate. E.g., 2000 will simulate the most common 2000 tokens in
  the training set.



# Guessing Configuration Options:

lower_probability_threshold - This controls how many passwords to output
  during generation. Lower threshold means more passwords. A value of 1e-7 will
  output all passwords with probability above 1e-7.

relevel_not_matching_passwords - If true, then passwords that do not match the
  filter policy will have their probability equal to zero and that probability
  will be redistributed to other passwords. Recommended true.

guess_serialization_method - Default is 'human' which enumerates all passwords
  above the lower_probability_threshold cutoff. 'delamico_random_walk' means
  calculate password guess numbers using Monte Carlo simulations.
  'generate_random' means generate random passwords. 'calculator' enumerates
  all passwords, but does not save the enumerated passwords to disk; instead it
  calculates the guess number of the test set of passwords.

parallel_guessing - Boolean. If true, then use multiple cores to generate
  passwords.

fork_length - The prefix length to fork on when parallel_guessing is true. If
  this value is 2, then prefixes of length 2 will be assigned to different
  cores. For example, one core will generate passwords that start with 'aa',
  another with 'ab', etc.

guesser_intermediate_directory - Directory to store intermediate files used in
  parallel guessing.

cleanup_guesser_files - If true, then delete files in the
  guesser_intermediate_directory after completion.

password_test_fname - File name containing test passwords. Each password should
  be on one line.

chunk_size_guesser - Number of passwords to send to the GPU in one chunk. More
  increases performance but could run out GPU of memory.

max_gpu_prediction_size - Maximum number of password fragments to send to the
  GPU in one chunk. More increases performance but could run out GPU of memory.

gpu_fork_bias - Ratio to decrease the chunk size when using multiple processes.
  Parallel guessing takes up more fixed memory on the GPU so can lead to
  running out of GPU memory more easily. This value controls how much to
  decrease memory by when forking.

cpu_limit - Number of processes to fork when using parallel guessing.

tokenize_guessing - If true, and if tokenize_words is true, then perform
  tokenization during guessing.

probability_striation - If non-zero, then instead of enumerating probabilities
  for specific passwords, instead enumerate the guess numbers at certain
  probability cutoffs. This is useful for exporting a pre-computation of
  probability to guess number mapping.

prob_striation_step - If probability_striation is true, then it will calculate
  guess numbers for 10^(j * prob_striation_step) for j in
  1..probability_striation. So for example, for prob_striation_step = 1 and
  probability_striation = 10, it would calculate the guess number at the
  followoing probabilities: 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8,
  1e-9, 1e-10.

enforced_policy - Will not generate guesses that do not match the policy.
  Currently supported policies are:

    'complex' - requires 8 characters and 4 classes.
    'basic' - no requirements
    '1class8' - requires 8 characters
    'basic_long' - requires 16 characters
    'complex_lowercase' - requires 8 characters and 3 character classes
                          insensitive to case.
    'complex_long' - requires 16 characters and 3 character classes
    'complex_long_lowercase' - requires 16 characters and 2 character classes
                               insensitive to case.
    'semi_complex' - requires 12 characters and 3 character classes
    'semi_complex_lowercase' - requires 12 characters and 2 character classes
                               insensitive to case.
    '3class12' - Same as semi_complex
    '2class12_all_lowercase' - Same as semi_complex_lowercase
    'one_uppercase' - Requires at least one uppercase character

  *_lowercase policies mean that they are insensitve to case and case is
  ignored. These are useful when preparing a train set using the
  policyfilterer.py utility, but not useful for training or guessing with a
  neural network.



# Monte Carlo Methods Configuration Options:

random_walk_seed_num - Number of passwords to keep in main memory in one chunk.
  More increases memory requirements.

random_walk_confidence_bound_z_value - confidence bound coefficeint. This
  should be correspond to the coefficient for a confidence interval. E.g., 95%
  means a value of 1.96, 99% means a value of 2.58
  [https://en.wikipedia.org/wiki/Confidence_interval]. Default is 1.96.

random_walk_confidence_percent - Confidence percent for the random_walk
  guesser. A value of 5 will mean that the simulation will continue until all
  passwords have confidence interval less than 5% of the estimated guess
  number.

random_walk_upper_bound - Upper bound on the number of rounds to continue
  simulation.

pwd_list_weights - Weighting to give different training sets. This should be a
  json dictionary mapping file names to a ratio:

  "pwd_list_weights" : {
                     "file1" : 1,
                     "file2" : 2
  }

  This will weight passwords in file1 as being twice as important as file2.


# Deprecated Configuration Options related to Trie preprocessing. Don't use these:

trie_serializer_encoding - default is 'utf8'.

trie_serializer_type - 'reg' or 'fuzzy'.

trie_implementation - Trie implementation. 'trie' for custom
  implementation. None for no trie optimization.

trie_fname - File name for storing trie.

trie_intermediate_storage - File for storing intermediate trie.

preprocess_trie_on_disk

preprocess_trie_on_disk_buff_size

toc_chunk_size

use_mmap

fuzzy_training_smoothing

scheduled_sampling

final_schedule_ratio



Example Configuration File
--------------------------

You can also see the pre_built_networks/ directory for examples of
configuration files. Here are some starting configuration files that you should
modify to suit your needs.

Combined arguments and configuration file for generic training.

{
    "args" : {
        "arch_file" : "arch.json",
        "weight_file" : "weight.h5",
        "log_file" : "train_log.txt",
        "pwd_file" : [
            "[INPUT_FILE]"
        ],
        "pwd_format" : [
            "list"
        ]
    },
    "config" : {
        "training_chunk" : 1000,
        "training_main_memory_chunk": 10000000,
        "min_len" : 8,
        "max_len" : 30,
        "context_length" : 10,
        "chunk_print_interval" : 100,
        "layers" : 2,
        "hidden_size" : 1000,
        "generations" : 5,
        "training_accuracy_threshold" : -1,
        "train_test_ratio" : 20,
        "model_type" : "LSTM",
        "train_backwards" : true,
        "dense_layers" : 1,
        "dense_hidden_size" : 512,
        "secondary_training" : true,
        "secondary_train_sets" : {
            "pwd_file" : [
                "[SECONDARY_INPUT_OPTIONAL]"
            ],
            "pwd_format" : [
                "list"
            ]
        },

        "simulated_frequency_optimization" : false,
        "randomize_training_order" : true,
        "uppercase_character_optimization" : true,
        "rare_character_optimization" : true,
        "rare_character_optimization_guessing" : true,
        "parallel_guessing" : false,
        "chunk_size_guesser" : 40000,
        "random_walk_seed_num" : 100000,
        "max_gpu_prediction_size" : 10000,
        "random_walk_seed_iterations" : 1,
        "no_end_word_cache" : true,
        "intermediate_fname" : "intermediate_data.sqlite",
        "save_model_versioned" : true
    }
}


Example config of enumerating passwords:

{
    "args" : {
        "arch_file" : "arch.json",
        "weight_file" : "nn_len8.h5",
        "log_file" : "guess_log.txt",
        "enumerate_ofile" : "g1_enumerate.tsv"
    },
    "config" : {
        "training_chunk" : 10000,
        "min_len" : 8,
        "max_len" : 30,
        "context_length" : 10,
        "chunk_print_interval" : 100,
        "layers" : 2,
        "hidden_size" : 1000,
        "model_type" : "JZS2",
        "simulated_frequency_optimization" : true,
        "intermediate_fname" : "intermediate_data.sqlite",
        "randomize_training_order" : true,
        "uppercase_character_optimization" : true,
        "rare_character_optimization" : true,
        "rare_character_optimization_guessing" : true,
        "parallel_guessing" : false,
        "lower_probability_threshold" : 1e-6,
        "padding_character" : true,
        "chunk_size_guesser" : 20000,
        "guess_serialization_method" : "human",
        "random_walk_seed_num" : 100000,
        "max_gpu_prediction_size" : 20000,
        "random_walk_seed_iterations" : 1,
        "no_end_word_cache" : true
    }
}



Combined arguments and configuration file for guessing using Monte Carlo
simulations:

{
  "args" : {
    "arch_file" : "arch.json",
    "weight_file" : "all_trained.h5.3",
    "log_file" : "guess_log.txt",
    "enumerate_ofile": "g3_long.tsv"
  },
  "config" : {
    "training_chunk" : 1000,
    "training_main_memory_chunk": 10000000,
    "min_len" : 16,
    "max_len" : 30,
    "context_length" : 10,
    "chunk_print_interval" : 100,
    "layers" : 2,
    "hidden_size" : 1000,
    "generations" : 3,
    "training_accuracy_threshold" : -1,
    "train_test_ratio" : 20,
    "model_type" : "JZS2",
    "tokenize_words" : false,
    "most_common_token_count" : 2000,

    "bidirectional_rnn" : false,
    "train_backwards" : true,

    "dense_layers" : 1,
    "dense_hidden_size" : 512,
    "secondary_training" : true,
    "secondary_train_sets" : {
      "pwd_file" : [
        "../leaks/all_combined_long_v2.txt"
      ],
      "pwd_format" : [
        "list"
      ]
    },

    "simulated_frequency_optimization" : false,

    "randomize_training_order" : true,
    "uppercase_character_optimization" : true,
    "rare_character_optimization" : true,

    "rare_character_optimization_guessing" : true,
    "parallel_guessing" : false,
    "lower_probability_threshold" : 1e-7,
    "chunk_size_guesser" : 40000,
    "guess_serialization_method" : "delamico_random_walk",
    "password_test_fname" : "../leaks/basic16.txt",
    "random_walk_seed_num" : 100000,
    "max_gpu_prediction_size" : 10000,
    "random_walk_seed_iterations" : 50,
    "no_end_word_cache" : true,
    "intermediate_fname" : "intermediate_data.sqlite",
    "save_model_versioned" : true
  }
}


Example guessing configuration for a complex policy.

{
    "args" : {
        "arch_file" : "arch.json",
        "weight_file" : "all_trained_cmplx.h5.3",
        "log_file" : "guess_log.txt",
      "enumerate_ofile": "g1_complex.tsv"
    },
    "config" : {
        "training_chunk" : 1000,
        "training_main_memory_chunk": 10000000,
        "min_len" : 8,
        "max_len" : 30,
        "context_length" : 10,
        "chunk_print_interval" : 100,
        "layers" : 2,
        "hidden_size" : 1000,
        "generations" : 3,
        "training_accuracy_threshold" : -1,
        "train_test_ratio" : 20,
        "model_type" : "JZS2",
        "tokenize_words" : false,
        "most_common_token_count" : 2000,
        "enforced_policy" : "complex",

        "bidirectional_rnn" : false,
        "train_backwards" : true,

        "dense_layers" : 1,
        "dense_hidden_size" : 512,
        "secondary_training" : true,
        "secondary_train_sets" : {
            "pwd_file" : [
                "../leaks/all_combined_long_v2.txt"
            ],
            "pwd_format" : [
                "list"
            ]
        },

        "simulated_frequency_optimization" : false,

        "randomize_training_order" : true,
        "uppercase_character_optimization" : true,
        "rare_character_optimization" : true,

        "rare_character_optimization_guessing" : true,
        "parallel_guessing" : false,
        "lower_probability_threshold" : 1e-7,
        "chunk_size_guesser" : 40000,
        "guess_serialization_method" : "delamico_random_walk",
        "password_test_fname" : "../leaks/complex/andrew8.txt",
        "random_walk_seed_num" : 100000,
        "max_gpu_prediction_size" : 10000,
        "random_walk_seed_iterations" : 1,
        "no_end_word_cache" : true,
        "intermediate_fname" : "intermediate_data.sqlite",
        "save_model_versioned" : true
    }
}