Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning
This GitHub page serves as the AE of Usenix Security 2023 for Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning.
We include three artifacts here: the Calpric Privacy Policy Corpus (CPPS), a customized BERT-based embedding pre-trained using privacy policy texts (PriBERT), and a source code example of the crowdsourcing and active learning components of the Calpric category model.
The CPPS data set includes privacy policy segment labels covering 9 data categories (contact, device, location, health, financial, demographic, survey, social media, and personally identifiable information) with 3 data actions (collect/use, share, and store). For clarity purposes, duplicated labels have been removed, resulting in a total of 12,585 labels. The dataset is in CSV format.
- Python standard library
- re
- langdetect
- numpy
- os
- pandas
- math
- keras
- modAL
- tensorflow
- Python standard library
- csv
- CPPS check
Wenjun Qiu, David Lie and Lisa Austin, “Calpric: Inclusive and Fine-grained Labeling of Privacy Policies with Crowdsourcing and Active Learning”, In Proceedings of the 32th USENIX Security Symposium, 2023. (To appear.)