# Getting started with Anonymouus: Example with test data #

This Jupyter notebook runs through the test data in the AnonymoUUs repository to demonstrate the workings of the AnonymoUUs package. To run the code in this Jupyter notebook yourself, [***insert instructions*** -> I cloned the entire repository but that should not be necessary when you have pip installed anonymouus already?].

**Please note**: AnonymoUUs substitutes strings in data files with replacement strings, for example names by numbers. Whereas replacing personal details with non-personal details can make data less identifiable, it does not guarantee fully anonymised data.

To run the code in each cell:

    1. Click on the cell to select it
    2. Press SHIFT+ENTER on your keyboard or press the play button in the toolbar above

### 1. Install and import packages ###

In [1]:
# Install anonymouus so that the Jupyter notebook can access it
import sys
!{sys.executable} -m pip install anonymouus



In [2]:
# Import packages
import csv
from pathlib import Path
import pandas as pd
import re

from anonymouus import Anonymize

### 2. Provide the path to the data to be substituted ###

In [3]:
test_data = Path('../tests/test_data/')

### 3. Provide the mapping ###

In the mapping, you specify the keywords to be replaced in the data file(s) and their substitutes, for example a name-number keyfile. The mapping can be: 
1. a dictionary (e.g., .json file)
2. the path to a csv file 
3. a function 

When using a csv-file, make sure: 
* your file has two columns 
    * left column: words to be replaced (e.g., "name")
    * right column: substitutions (e.g., "participant number")
* your file has a column header (any format)

For example:

In [4]:
key_csv = test_data/'keys.csv'

In [5]:
# Here's what the csv mapping looks like
df_key = pd.read_csv(key_csv,index_col='names')
df_key

Unnamed: 0_level_0,subt
names,Unnamed: 1_level_1
Jane Doe,aaaa
Amsterdam,bbbb
j.doe@gmail.com,cccc
r#ca.*?er,dddd


Note that besides strings, you can also add regular expressions as keywords. In this case, the strings corresponding to the specified pattern will be replaced.
* In a csv file: ```r#_my regex_```
* In a dictionary:```re.compile('_my regex_')``` 

In [6]:
# Another example of a mapping, now in a dictionary variable
key_dict = {
'Jane Doe': 'aaaa',
'Amsterdam':'bbbb',
'j.doe@gmail.com':'cccc',
re.compile('ca.*?er'):'dddd'
} 

### 4. Create an Anonymize object ###

This object is a prerequisite to perform the substition method on.

In [7]:
anym = Anonymize(key_csv)

Customize the replacement process by adding options or flags

* Replace only entire words: ```Anonymize(key_csv,use_word_boundaries=True)```
* Replace case-insensitive: ```Anonymize(key_csv,flags=re.IGNORECASE)```

### 5. Perform the substitutions ###

The subsititute method is the step where the specified words will be replaced by the substitutions. It will replace **all** occurrences of the specified words with the substutions, in all files in the provided source folder.   

Provide the path to the folder with:
* path to the original data (source)
* path to the resulting data (target); this path will be created if it does not exist yet

In [8]:
# Perform the substitutions and put the results in a folder called "pseudonymised"
anym.substitute(test_data,Path.cwd()/'pseudonymised/')

There is now a new folder called "pseudonymised" in the same folder as this Jupyter notebook.

### 6. Check the test data ###

Finally, it is always wise to check whether the words you wanted were substituted correctly. Depending on the type of data you have, you can do this manually or via code.

In [9]:
# Read the newly created .txt file
with open('pseudonymised/test_data/aaaa.txt') as f:
    lines = f.readlines()

lines

['My name is aaaa and I live in bbbb.\n', 'Casper loves his ddddpillar']

In [10]:
# Read the newly created .json file
with open('pseudonymised/test_data/profile.json') as f:
    lines2 = f.readlines()

lines2

['{\n',
 '    "name": {\n',
 '      "givenName": "Jane",\n',
 '      "familyName": "Doe",\n',
 '      "formattedName": "aaaa"\n',
 '    },\n',
 '    "displayName": "aaaa",\n',
 '    "emails": [{\n',
 '      "value": "cccc"\n',
 '    }],\n',
 '    "gender": {\n',
 '      "type": "female"\n',
 '    }\n',
 '  }']