GDPR Anonymization Tool (GSOC2018)
This tool is built to aid companies/organizations in their text anonymization needs and was developed in the context of CLiPS' Google Summer of Code 2018
What is text anonymization?
Text Anonymization refers to the processing of text, stripping it of any attributes/identifiers thus hiding sensitive details and protecting the identity of users.
This system consists of two main components.
Sensitive Attribute Detection System
Before text can be anonymized, the sensitive attributes in the text which give out information have to be identified. We use two methods to do the same (that can be used in tandem or as standalone systems as well)
- Named Entity Recognition Based Detection: This relies on tagging of tagging of sensitive entities in text. The user can setup different configurations for different entities which would determine how the given entity anonymized. The different options available are: Deletion/Replacement, Supression and Generalization. The system currently ships with Spacy's NER system, but can very easily be switched out for other NER models.
- TF-IDF Based rare entity detection: Certain sensitive attributes in text might not neccesarily be tagged/identified by the NER system. These sensitive tokens can be identified by the TF-IDF system. The term frequency–inverse document frequency identifies possible rare entities in text based on the distribution and occurence of tokens across sample text snippets supplied by the user. Once the TF-IDF score threshold is set, tokens with scores above the same are determined to be sensitive and anonymized.
Sensitive Attribute Anonymization System
Once the sensitive attributes/tokens are detected, they need to be anonymized depening on the kind of token they are. The user can set different anonymization actions for different tokens. The currently available options are:
- Deletion/Replacement: To be used in certain cases where retaining a part of the attribute through the other anonymization methods too is not appropriate. Completely replaces the attribute with a pre-set replacement.
My name is John Doewould be replaced by
My name is <Name>.
- Supression: To be used when hiding a part of the information is enough to protect the user's anonymity. The user can supply the percentage or the number of bits they want to be supressed.
My phone number is 9876543210would be replaced by
My phone number is 98675*****if the user chooses 50% supression.
- Generalization: To be used when the entity is sensitive enough to need anonymization but can still be partially retained to provide information. This system has two methods of carrying out generalization
- Word Vector Based: In this option of generalization, the nearest neighbor of the word in the vector space is utlilized to generalize the attribute.
I live in Indiaget's generalized to
I live in Pakistan. This method, while completely changing the word largely retains vector space information useful in most NLP and Text Processing tasks
- Part Holonym Based: In this option, the system parses the Wordnet Lexical Database to extract part holonyms. This method works exceptionally well with geographical entities. In this, the user can choose the level of generalization.
I live in Beijingget's generalized to
I live in Chinaat level 1 generalization and to
I live in Asiaat level 2 of generalization.
- Word Vector Based: In this option of generalization, the nearest neighbor of the word in the vector space is utlilized to generalize the attribute. Example:
Installation / Setup
By using a Setup Script
You may directly use a setup script to complete all the steps. Currently tested with a Ubuntu 16.04, Python 3.6 setup. Not recommended because of the very large downloads which may fail. Find the setup script here.
By following instructions (recommended)
- Pull the code from github using
git clone https://github.com/clips/gsoc2018.git
- Traverse into the GDPR directory using
- Install the requirments using
pip3 install -r requirements.txt.
- You also need to download the spacy model. Download and install it using
python3 -m spacy download en_core_web_sm. Should not take more a few minutes (29 MB of download).
- Download NLTK corpus by using
python3 -c "import nltk; nltk.download('wordnet')"
- Download the Plasticity AI's "magnitude" format of wordvectors by clicking here. (To be used in Wordvector based generalization)
- Paste the file in the privacy directory. (You may also do this in one step by doing
cd privacyfollowed by
- Install the magnitude package by doing:
SKIP_MAGNITUDE_WHEEL=1 pip3 install pymagnitude==0.1.46(not there in the requirements because of the extra flag)
- In case you haven't already, go into the privacy directory by doing
cd privacyand run the following commands :
python3 manage.py makemigrationsand
python3 manage.py migrate.
- Your application is now ready to use. Run
python3 manage.py runserverto start it up and navigate in your console to localhost:8000 to view the webapp.
Additional steps (optional)
- An admin-panel is available at localhost:8000/admin after adding a superuser
python3 manage.py createsuperuser
- If you are deploying the app on a server that needs to be reached externally, add your host name in
ALLOWED_HOSTS =in the file
python3 manage.py runserver 0.0.0.0:8000(or whatever port you want to run it on).
- To enable WordVector based generalization, you need to perform some additional steps:
- Download the GloVe magnitude file
- Make sure the path in
views.pypoints to the file's location
- Download the GloVe magnitude file