<a href="https://colab.research.google.com/github/apmoore1/target-extraction/blob/master/tutorials/Anonymise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%%capture
!pip install -U git+git://github.com/apmoore1/target-extraction.git@master#egg=target-extraction

# How to use the anonymise functionaility within TargetTextCollection and why

The "why" is due to some of he Aspect/Target Based Sentiment Analysis (ABSA) datasets coming from sources that require you to sign a license. These datasets are free to use but do not allow you to redistribute the data such as the [SemEval 2014 task 4 datasets](http://metashare.ilsp.gr:8080/repository/browse/semeval-2014-absa-train-data-v20-annotation-guidelines/683b709298b811e3a0e2842b2b6a04d7c7a19307f18a4940beef6a6143f937f0/). Therefore to allow everyone to share results the anonymised functionaility was added to the TargetTextCollection.

To explain the functionaility we are going to use the [Election Twitter dataset](https://figshare.com/articles/EACL_2017_-_Multi-target_UK_election_Twitter_sentiment_corpus/4479563/1) specifically the test set as it is the smallest split:

In [2]:
from target_extraction.dataset_parsers import wang_2017_election_twitter_test
test_dataset = wang_2017_election_twitter_test()
next(test_dataset.dict_iterator())

{'categories': None,
 'category_sentiments': None,
 'spans': [Span(start=25, end=33), Span(start=73, end=83)],
 'target_sentiments': ['neutral', 'neutral'],
 'targets': ['economic', 'Budget2015'],
 'text': "Don't you kinda wish all economic news was delivered like this? #GE2015 #Budget2015 http://t.co/4fssrNqnyj",
 'text_id': '78336255642988544'}

Above we can see one sentence with two targets example.

In this notebook we are going to show how to:
1. Anonymise a dataset
2. de-anonymise a dataset

## Anonymise a dataset

Before anonymising the dataset we are going to add some metadata to explain what the dataset is. As we can see below currently we have no metadata only an empty name field:

In [4]:
test_dataset.metadata

{'name': ''}

Lets add a name to the metadata and the split of the dataset

In [5]:
test_dataset.name = 'Election'
test_dataset.metadata['split'] = 'Test'
print(f'Name attribute {test_dataset.name}')
print(f'Metadata: {test_dataset.metadata}')

Name attribute Election
Metadata: {'name': 'Election', 'split': 'Test'}


As we can see the `metadata` stores all of the meta information about the dataset including the name attribute.

Now that we know some more about the TargetTextCollection we can move on to anonymising the dataset like so:

In [6]:
test_dataset.anonymised = True
next(test_dataset.dict_iterator())

{'categories': None,
 'category_sentiments': None,
 'spans': [Span(start=25, end=33), Span(start=73, end=83)],
 'target_sentiments': ['neutral', 'neutral'],
 'targets': ['economic', 'Budget2015'],
 'text_id': '78336255642988544'}

We can see now that the dataset has been anonymised as a sample from the dataset which is the same as the one before no longer has any text. However all of the other parts of the data are still there allowing you to perform some analysis on the dataset e.g. Number of targets.

Furthermore the dataset's anonymised attribute is now `True`, and has also been added to the `metadata`:

In [7]:
print(f'Anonymised {test_dataset.anonymised}')
print(f'Metadata: {test_dataset.metadata}')

Anonymised True
Metadata: {'name': 'Election', 'split': 'Test', 'anonymised': True}


### Exporting and Importing
Once the dataset has been anonymised you may want to export it so that the dataset can be shared with others. This is useful if you want to share results that are sotred within the TargetTextCollection.

From this anonymised state we can export to JSON String and back:

In [16]:
from target_extraction.data_types import TargetTextCollection
export_json = test_dataset.to_json()
print(f'Exported JSON string: {export_json[:100]}')
json_loaded_dataset = TargetTextCollection.from_json(export_json)
print(f'Example from the loaded JSON:\n{next(json_loaded_dataset.dict_iterator())}')
print(f'Metadata from the loaded JSON: {json_loaded_dataset.metadata}')

Exported JSON string: {"text_id": "78336255642988544", "targets": ["economic", "Budget2015"], "spans": [[25, 33], [73, 83]
Example from the loaded JSON:
{'text_id': '78336255642988544', 'targets': ['economic', 'Budget2015'], 'spans': [Span(start=25, end=33), Span(start=73, end=83)], 'target_sentiments': ['neutral', 'neutral'], 'categories': None, 'category_sentiments': None}
Metadata from the loaded JSON: {'name': 'Election', 'split': 'Test', 'anonymised': True}


We can also export to a JSON file and back:

In [19]:
import tempfile
from pathlib import Path
with tempfile.NamedTemporaryFile(mode='w+') as temp_file:
  # File path to save the data to
  temp_fp = Path(temp_file.name)
  test_dataset.to_json_file(temp_fp, include_metadata=True)
  json_loaded_dataset = TargetTextCollection.load_json(temp_fp)
print(f'Example from the loaded JSON:\n{next(json_loaded_dataset.dict_iterator())}')
print(f'Metadata from the loaded JSON: {json_loaded_dataset.metadata}')

Example from the loaded JSON:
{'text_id': '78336255642988544', 'targets': ['economic', 'Budget2015'], 'spans': [Span(start=25, end=33), Span(start=73, end=83)], 'target_sentiments': ['neutral', 'neutral'], 'categories': None, 'category_sentiments': None}
Metadata from the loaded JSON: {'name': 'Election', 'split': 'Test', 'anonymised': True}


When loading these datasets whether from JSON String or File the `metadata`, `name`, and `anonymised` can be overidden:

In [23]:
export_json = test_dataset.to_json()
new_name = 'Twitter Election'
new_metadata = {'language': 'English', 'split': 'Validation'}
json_loaded_dataset = TargetTextCollection.from_json(export_json, name=new_name, 
                                                     metadata=new_metadata)
print(f'Name from loaded JSON: {json_loaded_dataset.name}')
print(f'Metadata from loaded JSON: {json_loaded_dataset.metadata}')

print('\nOriginal dataset still has the original Name and Metadata')
print(f'Name: {test_dataset.name}')
print(f'Metadata: {test_dataset.metadata}')

Name from loaded JSON: Twitter Election
Metadata from loaded JSON: {'language': 'English', 'split': 'Validation', 'anonymised': True, 'name': 'Twitter Election'}

Original dataset still has the original Name and Metadata
Name: Election
Metadata: {'name': 'Election', 'split': 'Test', 'anonymised': True}


If you do override the `anonymised` value when loading it will also anonymise your data:

In [25]:
not_anonymised_election = wang_2017_election_twitter_test()
print(f'Is anonymised {not_anonymised_election.anonymised}')
data_example = next(not_anonymised_election.dict_iterator())
print(f'Data example from original {data_example}')

print('\nExport the data')
election_json = not_anonymised_election.to_json()
print('Anonymise the data when we load the data')
anonymised_election = TargetTextCollection.from_json(election_json, anonymised=True)
print(f'Is anonymised {anonymised_election.anonymised}')
data_example = next(anonymised_election.dict_iterator())
print(f'Data example from anonymised {data_example}')

Is anonymised False
Data example from original {'text': "Don't you kinda wish all economic news was delivered like this? #GE2015 #Budget2015 http://t.co/4fssrNqnyj", 'text_id': '78336255642988544', 'targets': ['economic', 'Budget2015'], 'spans': [Span(start=25, end=33), Span(start=73, end=83)], 'target_sentiments': ['neutral', 'neutral'], 'categories': None, 'category_sentiments': None}

Export the data
Anonymise the data when we load the data
Is anonymised True
Data example from anonymised {'text_id': '78336255642988544', 'targets': ['economic', 'Budget2015'], 'spans': [Span(start=25, end=33), Span(start=73, end=83)], 'target_sentiments': ['neutral', 'neutral'], 'categories': None, 'category_sentiments': None}


## De-Anonymised data

If you have someone elses anonymised data but want to get the `text` back to perform some more analyses the `de_anonymise` function is required. 

The `de_anonymise` function assumes that you have the `text` mapped to a unique key that matches the unique key within the anonymised data.

For example the `unique key` for each TargetText/sample in each TargetTextCollection is defined by the `text_id` in each: 

In [29]:
test_dataset_keys = list(test_dataset.keys())[:5]
print('Example unique keys from the anonymised Election dataset\n'
      f'{test_dataset_keys}')
example_sample = test_dataset[test_dataset_keys[0]]
print('Example TargetText/sample from the Election dataset\n'
      f'{example_sample}')

Example unique keys from the anonymised Election dataset
['78336255642988544', '81213828119175169', '81236070488113152', '78191643813232640', '65025637522354178']
Example TargetText/sample from the Election dataset
TargetText({'text_id': '78336255642988544', 'targets': ['economic', 'Budget2015'], 'spans': [Span(start=25, end=33), Span(start=73, end=83)], 'target_sentiments': ['neutral', 'neutral'], 'categories': None, 'category_sentiments': None})


As we can see from above the TargetTextCollection keys are all the unique `text_id` from all of the samples/TargetTexts.

Therefore to de-anonymise this collection we need a dictionary of:
``` python
{'text_id': 'text'}
```
E.g.:
``` python
{'78336255642988544': "Don't you kinda wish all economic news was delivered like this? #GE2015 #Budget2015 http://t.co/4fssrNqnyj"}
```

Therefore we can easily do this for known datasets if we have the anonymised and non-anonymised version like so:

In [33]:
anonymised_test_dataset = wang_2017_election_twitter_test()
anonymised_test_dataset.anonymised = True
assert anonymised_test_dataset.anonymised
non_anonymised_test_dataset = wang_2017_election_twitter_test()
assert not non_anonymised_test_dataset.anonymised

anonymised_test_dataset.de_anonymise(non_anonymised_test_dataset.dict_iterator())
assert not anonymised_test_dataset.anonymised
example_sample = anonymised_test_dataset['78336255642988544']
for key, value in example_sample.items():
  print(f'{key} : {value}')

text_id : 78336255642988544
targets : ['economic', 'Budget2015']
spans : [Span(start=25, end=33), Span(start=73, end=83)]
target_sentiments : ['neutral', 'neutral']
categories : None
category_sentiments : None
text : Don't you kinda wish all economic news was delivered like this? #GE2015 #Budget2015 http://t.co/4fssrNqnyj


As we can see from above the anonymised dataset has now been de-anonymised by using an original non-anonymised version of the dataset.