<a href="https://colab.research.google.com/github/apmoore1/target-extraction/blob/master/tutorials/Target_IDs_and_re_ordering_targets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%%capture
!pip install -U git+git://github.com/apmoore1/target-extraction.git@master#egg=target-extraction

# Target IDs, re-ordering targets, and combining on data on IDs

In this tutorial we will show you how to:
1. create unique IDs at the target level but the methods used here can generalise to other attributes like aspects.
2. Re-oder a dataset so that the target attributes are ordered based on when they occur in the text. This is useful for methods that assume the targets are ordered e.g. [Hazarika et al. 2018](https://www.aclweb.org/anthology/N18-2043.pdf) inter aspect/target neural network. 
3. Combining data from one dataset into another based on the target ID from point 1.

The reason why you might want to save and create unique target or aspect ID's is so that your work can be shared with others knowing, that each prediction on a target or aspect has a unique identifier.

To demonstrate all of the above we are going to use the training split from the Election Twitter dataset from [Wang et al. 2017](https://www.aclweb.org/anthology/E17-1046.pdf) as this can easily be downloaded as shown below:

In [0]:
from target_extraction.dataset_parsers import wang_2017_election_twitter_train

dataset = wang_2017_election_twitter_train()

## Create Unique IDs
So we have a dataset and below we print one target text object which contains three targets (`Police`, `crime`, and `Conservatives`). However the only unique identifier in this target object is the `text_id` key which uniquely identifies this one target object but none of the targets (or categories if they existed in the dataset).

In [3]:
target_object = next(dataset.dict_iterator())
target_object

{'categories': None,
 'category_sentiments': None,
 'spans': [Span(start=22, end=28),
  Span(start=49, end=54),
  Span(start=0, end=13)],
 'target_sentiments': ['neutral', 'positive', 'neutral'],
 'targets': ['Police', 'crime', 'Conservatives'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072'}

Therefore we can add unique identifiers for the targets using the:
`add_unique_key` function which has two arguments:
1. The key that you want to uniquely identify in our case this is the `target` key.
2. The name of the key that will store the unique identifiers e.g. `target_ids`

In [4]:
dataset.add_unique_key('targets', 'target_ids')
target_object = next(dataset.dict_iterator())
target_object

{'categories': None,
 'category_sentiments': None,
 'spans': [Span(start=22, end=28),
  Span(start=49, end=54),
  Span(start=0, end=13)],
 'target_ids': ['81207500663427072::0',
  '81207500663427072::1',
  '81207500663427072::2'],
 'target_sentiments': ['neutral', 'positive', 'neutral'],
 'targets': ['Police', 'crime', 'Conservatives'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072'}

We can see now that we have a new field `target_ids` which are unique identifiers for the `targets` key. Furthermore we can see that the unique ids are generated by combining the `text_id` value with a delimiter (by default this is `::`) and then the index of the value we are creating ids for.

If you want to use a different `delimiter` this can be easily done as shown below:

In [5]:
# need to delete the current target_id field before adding it again:
for value in dataset.values():
  del value['target_ids']
dataset.add_unique_key('targets', 'target_ids', id_delimiter='$$')
target_object = next(dataset.dict_iterator())
target_object

{'categories': None,
 'category_sentiments': None,
 'spans': [Span(start=22, end=28),
  Span(start=49, end=54),
  Span(start=0, end=13)],
 'target_ids': ['81207500663427072$$0',
  '81207500663427072$$1',
  '81207500663427072$$2'],
 'target_sentiments': ['neutral', 'positive', 'neutral'],
 'targets': ['Police', 'crime', 'Conservatives'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072'}

## Re-ordering

As we can see from the example above the targets are not in order as in the targets are not in a left to right order when they appear in the text (`Conservatives` is the last target in the `targets` list but it is the first target shown in the text).

Some methods assume that the data they are feed are in this left to right order e.g. [Hazarika et al. 2018](https://www.aclweb.org/anthology/N18-2043.pdf) method which uses a LSTM to encode the inter aspect/target representations.

Thus sometimes we need to re-order the data:

In [6]:
dataset.re_order()
target_object = next(dataset.dict_iterator())
target_object

{'categories': None,
 'category_sentiments': None,
 'spans': [Span(start=0, end=13),
  Span(start=22, end=28),
  Span(start=49, end=54)],
 'target_ids': ['81207500663427072$$2',
  '81207500663427072$$0',
  '81207500663427072$$1'],
 'target_sentiments': ['neutral', 'neutral', 'positive'],
 'targets': ['Conservatives', 'Police', 'crime'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072'}

As we can see the `targets` are now re-ordered along with all the other `List` keys like `target_sentiments`, `spans`, and `target_ids`. This is because when `re_order` is called it re-orders all `List` keys according to the `spans` key.

If you do have some `List` fields that should not be re-ordered then this can be explicty taken into account through the `re_order` function. An example is shown below:

In [7]:
dataset.tokenize(str.split)
target_object = next(dataset.dict_iterator())
target_object

{'categories': None,
 'category_sentiments': None,
 'spans': [Span(start=0, end=13),
  Span(start=22, end=28),
  Span(start=49, end=54)],
 'target_ids': ['81207500663427072$$2',
  '81207500663427072$$0',
  '81207500663427072$$1'],
 'target_sentiments': ['neutral', 'neutral', 'positive'],
 'targets': ['Conservatives', 'Police', 'crime'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072',
 'tokenized_text': ['Conservatives',
  'cut',
  'the',
  'Police',
  'budget',
  'and',
  'this',
  'cut',
  'crime!',
  'Maybe',
  'not...',
  '#spin',
  '#BattleForNumber10']}

We now have a `List` key `tokenized_text` that is not associated to the `spans` key at all therefore if we are re-ordering this dataset we do not want to re-order `tokenized_text` field as this would result in the following error:

In [8]:
dataset.re_order()
target_object = next(dataset.dict_iterator())
target_object

Exception: ignored

In [9]:
target_object = next(dataset.dict_iterator())
target_object

{'categories': None,
 'category_sentiments': None,
 'spans': [Span(start=0, end=13),
  Span(start=22, end=28),
  Span(start=49, end=54)],
 'target_ids': ['81207500663427072$$2',
  '81207500663427072$$0',
  '81207500663427072$$1'],
 'target_sentiments': ['neutral', 'neutral', 'positive'],
 'targets': ['Conservatives', 'Police', 'crime'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072',
 'tokenized_text': ['Conservatives',
  'cut',
  'the',
  'Police',
  'budget',
  'and',
  'this',
  'cut',
  'crime!',
  'Maybe',
  'not...',
  '#spin',
  '#BattleForNumber10']}

However as we can see above even though the error has occured nothing has changed to the dataset. 

To avoid such errors use the `keys_not_to_order` argument within the `re_order` function like below:

In [10]:
# Want to make sure it is a new un-ordered dataset
dataset = wang_2017_election_twitter_train()
dataset.tokenize(str.split)
dataset.re_order(['tokenized_text'])
target_object = next(dataset.dict_iterator())
target_object

{'categories': None,
 'category_sentiments': None,
 'spans': [Span(start=0, end=13),
  Span(start=22, end=28),
  Span(start=49, end=54)],
 'target_sentiments': ['neutral', 'neutral', 'positive'],
 'targets': ['Conservatives', 'Police', 'crime'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072',
 'tokenized_text': ['Conservatives',
  'cut',
  'the',
  'Police',
  'budget',
  'and',
  'this',
  'cut',
  'crime!',
  'Maybe',
  'not...',
  '#spin',
  '#BattleForNumber10']}

As we can see above no error and the targets have been re-ordered in the expected manner.

## Combining data from two datasets using ID

Now that we have these unique target ids we can use them to combine data from two dataset instances. This can be useful when `dataset_1` is re-ordered and `dataset_2` is not and both instances have prediction made from different models e.g. `model_1` and `model_2`. Below we create this setup, where `dataset_1` has predictions from `model_1` (all positive unless the target is `Conservatives` then negative) and `dataset_2` has predictions from `model_2` (all negative unless the target is `Conservatives` then positive):

In [12]:
from target_extraction.data_types import TargetTextCollection
from typing import Optional, List
def add_fictional_predictions(dataset: TargetTextCollection, model_name: str, 
                              positive: bool, 
                              number_runs: Optional[int] = None) -> None:
  def get_predictions(positive_default: bool, targets: List[str]) -> List[str]:
    target_predictions = [] 
    for target in targets:
      if positive_default and target != 'Conservatives':
        target_predictions.append('positive')
      elif positive_default and target == 'Conservatives':
        target_predictions.append('negative')
      elif not positive_default and target != 'Conservatives':
        target_predictions.append('negative')
      elif not positive_default and target == 'Conservatives':
        target_predictions.append('positive')
      else:
        err = 'Fallen through all of the if statements in `get_prediction`'
        raise ValueError(err)
    return target_predictions

  for target_text in dataset.values():
    num_targets = len(target_text['targets'])
    prediction_list = []
    if number_runs is not None:
      for run in range(number_runs):
        prediction_list.append(get_predictions(positive, target_text['targets']))
    else:
      prediction_list = get_predictions(positive, target_text['targets'])
    target_text[f'{model_name}_predictions'] = prediction_list

dataset_1 = wang_2017_election_twitter_train()
dataset_2 = wang_2017_election_twitter_train()
# Add the unique target ids
for dataset in [dataset_1, dataset_2]:
  dataset.add_unique_key('targets', 'target_id')
# re-order dataset_1
dataset_1.re_order()
# add predictions to datasets
add_fictional_predictions(dataset_1, 'model_1', True)
add_fictional_predictions(dataset_2, 'model_2', False)

next(dataset_1.dict_iterator())

{'categories': None,
 'category_sentiments': None,
 'model_1_predictions': ['negative', 'positive', 'positive'],
 'spans': [Span(start=0, end=13),
  Span(start=22, end=28),
  Span(start=49, end=54)],
 'target_id': ['81207500663427072::2',
  '81207500663427072::0',
  '81207500663427072::1'],
 'target_sentiments': ['neutral', 'neutral', 'positive'],
 'targets': ['Conservatives', 'Police', 'crime'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072'}

An example of samples from `dataset_1` is shown above and `dataset_2` below.

In [13]:
next(dataset_2.dict_iterator())

{'categories': None,
 'category_sentiments': None,
 'model_2_predictions': ['negative', 'negative', 'positive'],
 'spans': [Span(start=22, end=28),
  Span(start=49, end=54),
  Span(start=0, end=13)],
 'target_id': ['81207500663427072::0',
  '81207500663427072::1',
  '81207500663427072::2'],
 'target_sentiments': ['neutral', 'positive', 'neutral'],
 'targets': ['Police', 'crime', 'Conservatives'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072'}

Lets say that we now want to add the predictions from `model_1` in `dataset_1` to `dataset_2` we can do this as follows:

In [14]:
dataset_2.combine_data_on_id(dataset_1, 'target_id', ['model_1_predictions'])
next(dataset_2.dict_iterator())

{'categories': None,
 'category_sentiments': None,
 'model_1_predictions': ['positive', 'positive', 'negative'],
 'model_2_predictions': ['negative', 'negative', 'positive'],
 'spans': [Span(start=22, end=28),
  Span(start=49, end=54),
  Span(start=0, end=13)],
 'target_id': ['81207500663427072::0',
  '81207500663427072::1',
  '81207500663427072::2'],
 'target_sentiments': ['neutral', 'positive', 'neutral'],
 'targets': ['Police', 'crime', 'Conservatives'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072'}

As we can see above `dataset_2` now has the predictions from `model_1` from `dataset_1` and predicitons are aligned correctly according to the `target_id` key.

The `combine_data_on_id` function mainly takes 3 arguments:
1. The dataset/`TargetTextCollection` that contains the data. In our case `dataset_1`
2. The key that both dataset can align with `target_id`
3. The name of the key's that contain the data from the dataset (`dataset_1`) that you want in our case `model_1_predictions`

Sometimes the number of keys of data you want from another dataset is quite large in that case you can use the following function to find the key difference between two datasets:

In [15]:
dataset_1.key_difference(dataset_2)

['model_2_predictions']

In [16]:
dataset_2.key_difference(dataset_1)

[]

The reason why there is no key difference between `dataset_2` and `dataset_1` is because we have just added that key data to `dataset_2`.

If you ran the `combine_data_on_id` function again you will get the following error:

In [17]:
dataset_2.combine_data_on_id(dataset_1, 'target_id', ['model_1_predictions'])

OverwriteError: ignored

This is because by default it will raise an error if you try to overwrite any data this can behaviour can be changed like so:

In [18]:
dataset_2.combine_data_on_id(dataset_1, 'target_id', ['model_1_predictions'], 
                             raise_on_overwrite=False)
next(dataset_2.dict_iterator())

{'categories': None,
 'category_sentiments': None,
 'model_1_predictions': ['positive', 'positive', 'negative'],
 'model_2_predictions': ['negative', 'negative', 'positive'],
 'spans': [Span(start=22, end=28),
  Span(start=49, end=54),
  Span(start=0, end=13)],
 'target_id': ['81207500663427072::0',
  '81207500663427072::1',
  '81207500663427072::2'],
 'target_sentiments': ['neutral', 'positive', 'neutral'],
 'targets': ['Police', 'crime', 'Conservatives'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072'}

As you can see no error and no change in data as we only combined the data that already existed.

### In-Detail Combining data
In this small sub-section we want to describe how `combine_data_on_id` works when your predictions are not just single predicitons but multiple predictions per target as you have run the same model multiple times to take into account the [random seed problem](https://www.aclweb.org/anthology/D17-1035.pdf) for example.

In the setup below we have a similar setup as before; `dataset_3` contains `model_1` and `dataset_4` contains `model_2`, and `dataset_3` is re-ordered. `model_1` predicts all positive unless the target is `Conservatives` and the vice versa for `model_2`. However un-like before we now run each model 8 times for example as each run takes into account a different random seed.

In [19]:
dataset_3 = wang_2017_election_twitter_train()
dataset_4 = wang_2017_election_twitter_train()
# Add the unique target ids
for dataset in [dataset_3, dataset_4]:
  dataset.add_unique_key('targets', 'target_id')
# re-order dataset_1
dataset_3.re_order()
# add predictions to datasets
add_fictional_predictions(dataset_3, 'model_1', True, number_runs=8)
add_fictional_predictions(dataset_4, 'model_2', False, number_runs=8)

next(dataset_3.dict_iterator())

{'categories': None,
 'category_sentiments': None,
 'model_1_predictions': [['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive']],
 'spans': [Span(start=0, end=13),
  Span(start=22, end=28),
  Span(start=49, end=54)],
 'target_id': ['81207500663427072::2',
  '81207500663427072::0',
  '81207500663427072::1'],
 'target_sentiments': ['neutral', 'neutral', 'positive'],
 'targets': ['Conservatives', 'Police', 'crime'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072'}

A sample from `dataset_3` is above and `dataset_4` below. From these samples we can now see that the predictions are a `List[List]` list of a list where the inner list relates to the `targets` and the outer list relates to the number of model runs (8).

In [20]:
next(dataset_4.dict_iterator())

{'categories': None,
 'category_sentiments': None,
 'model_2_predictions': [['negative', 'negative', 'positive'],
  ['negative', 'negative', 'positive'],
  ['negative', 'negative', 'positive'],
  ['negative', 'negative', 'positive'],
  ['negative', 'negative', 'positive'],
  ['negative', 'negative', 'positive'],
  ['negative', 'negative', 'positive'],
  ['negative', 'negative', 'positive']],
 'spans': [Span(start=22, end=28),
  Span(start=49, end=54),
  Span(start=0, end=13)],
 'target_id': ['81207500663427072::0',
  '81207500663427072::1',
  '81207500663427072::2'],
 'target_sentiments': ['neutral', 'positive', 'neutral'],
 'targets': ['Police', 'crime', 'Conservatives'],
 'text': 'Conservatives cut the Police budget and this cut crime! Maybe not... #spin #BattleForNumber10',
 'text_id': '81207500663427072'}

Now that we have these datasets we want the predictions from `model_2` combined into `dataset_3` which can be done like before using `combine_data_on_id`:

In [0]:
dataset_3.combine_data_on_id(dataset_4, 'target_id', data_keys=['model_2_predictions'])

In [22]:
next(dataset_3.dict_iterator())

{'categories': None,
 'category_sentiments': None,
 'model_1_predictions': [['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive'],
  ['negative', 'positive', 'positive']],
 'model_2_predictions': [['positive', 'negative', 'negative'],
  ['positive', 'negative', 'negative'],
  ['positive', 'negative', 'negative'],
  ['positive', 'negative', 'negative'],
  ['positive', 'negative', 'negative'],
  ['positive', 'negative', 'negative'],
  ['positive', 'negative', 'negative'],
  ['positive', 'negative', 'negative']],
 'spans': [Span(start=0, end=13),
  Span(start=22, end=28),
  Span(start=49, end=54)],
 'target_id': ['81207500663427072::2',
  '81207500663427072::0',
  '81207500663427072::1'],
 'target_sentiments': ['neutral', 'neutral', 'positive'],
 'targets': ['Conservati

As we can see above the `model_2_predictions` are now in `dataset_3`.