Skip to content

facebookresearch/synlm

Repository files navigation

Installation

Dependencies

pip install -r requirements.txt

Datasets

Download the datasets from their source and put them in a jsonl format. Each line of the dataset should look like this:

{"age": 25, "height": "1m78", "city": "Castelnaudary"}

Put the dataset in

$DATA_PATH/untokenized/${DATASET}_$SPLIT.jsonl

where $DATA_PATH is a chosen root, $DATASET is the name of the dataset and $SPLIT is 'train' and 'valid'.

Tokenize the data. The script will produce data under $DATA_PATH/tokenized/ and $DATA_PATH/discretized/.

python tokenization.py --dataset $DATASET

Modify the paths in common/paths.py

# common/paths.py
DATA_PATH = "/my/data/path"

Training a model

The following command trains a language model on the scooter dataset.

python -train_lm.py \
--steps 10001 \
--lr 0.0005 \
--warmup_steps 100 \
--num_heads 4 \
--num_layers 4 \
--num_embed 256 \
--method real \
--trie_guided true \
--augmentations 2 \
--physical_batch_size 16 \
--dataset $DATASET \
--permute_fields false \
--tokenizer level \
--vocab_size 10000 \
--num_workers 10 \
--print_freq 10 \
--val_freq 1000 \
--architecture custom \
--batch_size 128

License

The majority of SynLM is licensed under CC-BY-NC. However portions of the project are available under separate license terms: https://github.com/ryan112358/private-pgm/ is licensed under the Apache-2 license.

Citation

If you use this code in your research please cite

@article{sablayrolles2023privately,
title={Privately generating tabular data using language models},
author={Sablayrolles, Alexandre and Wang, Yue and Karrer, Brian},
journal={arXiv},
year={2023}
}

About

Code for paper: "Privately generating tabular data using language models".

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages