Skip to content

bmc/peoplegen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Random People Generator

This package is a simple Scala-based command-line tool to create fake people records, using first, middle, and last names taken at random from United States Census Bureau data that's captured in local files. By default, it splits the generated names so that half are female and half are male, but that can be changed via command line options.

(Yes, I know there are more than two genders. I support that distinction. The Census Bureau data files I'm using are from 2010, and they only supported two genders. For now, this program is consistent with that restriction, though I'm considering ways to expand it to generate data that's more reflective of gender reality.)

The tool can generate CSV or JSON output.

As is probably obvious, I use this program to generate test data.


WARNING: I built this tool for myself. You're welcome to use it, read it, comment on it, etc. However, do not expect me to maintain this tool rigorously. It's a playground for me, as well as something I use occasionally. That's all.


I also built a Rust version of this thing.

Installation

Clone this repo in the usual way. Then, read on.

peoplegen is built with SBT. You'll need to have SBT installed to proceed. See https://www.scala-sbt.org/download.html for details.

sbt install

will build a fat jar and install it in $HOME/local/libexec, by default. It'll then install a wrapper peoplegen script in $HOME/local/bin. You can change the prefix from $HOME/local to something else by setting the installDir in build.sbt to a different path. See the commented out example in build.sbt.

Note for Windows users: I don't run this thing on Windows, so I'm probably not going to go out of my way to support it there. It should work fine, but you're on your own if it doesn't.

Usage

At any time, you can run peoplegen --help for a usage summary. The command line looks like:

Usage: peoplegen [options] <total> [<outputfile>]
  • <total> is the total number of people records to generate.
  • <outputfile> is the file to which to write the records; if not supplied, the output goes to standard output.

Options

peoplegen currently supports the following options:

  • --help: Generate the usage message and exit.

  • -f <percent> or --female <percent>: Percent of female records to generate.

  • -m <percent> or --male <percent>: Percent of male records to generate.

NOTE: If you specify neither male nor female percentages, both default to 50. If you specify only one percentage, the other is set to the remainder. (e.g., If you specify only --male 60, the female percentage is set to 40.) If you set both percentages, they must add up to 100, or peoplegen will abort.

  • --id: Generate unique per-row ID values.

  • --ssn: Generate Social Security Number values. Note that the generated SSNs are deliberately invalid, as described at https://stackoverflow.com/a/2313726/53495.

  • --salaries: Generate salary data. Salaries are generated as a normal distribution of integers, around a mean of 72,641 (the U.S. mean salary in 2014), with a sigma (i.e., a spread, or standard deviation) of 20,000. To change these values, use --salary-mean and --salary-sigma.

  • --salary-mean <value>: You can use this option to change the mean salary for the salary distribution. Note: Changing this value can result in negative salaries, so check your final data.

  • --salary-sigma <value>: You can use this option to change the salary generation sigma—the spread, if you prefer. A smaller number means more salaries will cluster around the mean. A larger number means the distribution will be more "spread out". The distribution will still be a normal one (a bell curve), but the mean and the sigma control the shape of the curve.

  • --year-min <value>: Specify the starting year for birth dates. Defaults to 65 years ago from this year.

  • --year-max <value>: Specify the ending year for birth dates. Defaults to 18 years ago from this year. This year cannot precede the year-min value.

  • --delim <c>: (CSV only) The delimiter to use between columns. The default is a comma (","). Any single character is fine. For tab, use the 2-character sequence "\t".

  • --header: (CSV only) Generate a header for CSV output. Default: no header

  • -F <format> or --format <format>: The file format to generate. Allowable values: "csv" or "json"

  • --camel: Use camel case for CSV column names or JSON field names. For example: firstName, lastName

  • --english: Use English (space-separated) names for column names. For example: first name, last name

  • --snake: Use "snake case" (underscores) names for column names. For example: first_name, last_name

  • -j <format> or --json-format <format>: (JSON only) Specify how the JSON should be generated. Legal values:

    • "rows" (default): generate individual rows of 1-line JSON people records. This format is useful with Apache Spark.
    • "array": generate a JSON array with the JSON people records, all on one line
    • "pretty": generate pretty-printed JSON.
  • -v or --verbose: Emit (some) verbose processing messages.