<a href="https://colab.research.google.com/github/apmoore1/target_aspect_unique/blob/master/texts_contain_the_same_aspect_more_than_once.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%%capture
!pip install git+git://github.com/apmoore1/target-extraction.git@v0.0.2#egg=target_extraction

# SemEval 2015 and 2016 datasets do the texts contain the same aspect more than once?

The title is the question we are going to answer here. We are only going to look at sentences that contain (target, aspect, sentiment) where aspect can also be an (entity, aspect) pair. The reason why an aspect or (entity, aspect) pair can occur more than once in a text is because it is attached to a target that is within the text. Thus there could be the following case:

```
The CPU memory is great and so in the RAM
```

Given this text we could have the following annotations (memory, MEMORY, positive) and (RAM, MEMORY, positive) where the following represent (target, aspect, sentiment). Thus in this case the MEMORY aspect has occured twice in the same text. The reason we want to see if this occurs or not in the SemEval datasets is to see if we could treat the target sentiment problem as just a text/sentence level aspect based sentiment analysis task. Further whether when trying to identify aspects whether this has to be done at the target level or if it could be done at the text/sentence level.

To analysis this we first need to load the datasets. The datasets that have the following annotations (target, aspect, sentiment) are:
1. [Restaurant SemEval 2015 Train.](http://alt.qcri.org/semeval2015/task12/index.php?id=data-and-tools)
2. [Restaurant SemEval 2015 Test.](http://alt.qcri.org/semeval2015/task12/index.php?id=data-and-tools)
3. [Restaurant SemEval 2016 Train.](http://alt.qcri.org/semeval2015/task12/index.php?id=data-and-tools)
4. [Restaurant SemEval 2016 Test.](http://alt.qcri.org/semeval2015/task12/index.php?id=data-and-tools)

There are more SemEval datasets from the 2016 edition that contian this annotation format. However for now we are only going to look at these datasets.

In [42]:
from collections import Counter
from pathlib import Path

from google.colab import files
from target_extraction.dataset_parsers import semeval_2016
from target_extraction.data_types import TargetTextCollection

# SemEval 2014 Laptop and Restaurant
semeval_dataset_fp = {'Restaurant Train 2015': Path('ABSA-15_Restaurants_Train_Final.xml'),
                      'Restaurant Test 2015': Path('ABSA15_Restaurants_Test.xml'),
                      'Restaurant Train 2016': Path('ABSA16_Restaurants_Train_SB1_v2.xml'),
                      'Restaurant Test 2016': Path('EN_REST_SB1_TEST.xml')}
semeval_dataset = {}
for dataset_name, fp in semeval_dataset_fp.items():
  if not fp.exists():
    print(f'Upload {dataset_name}')
    files.upload()
  semeval_dataset[dataset_name] = semeval_2016(fp, conflict=True)


Upload Restaurant Train 2016


Saving ABSA16_Restaurants_Train_SB1_v2.xml to ABSA16_Restaurants_Train_SB1_v2.xml
Upload Restaurant Test 2016


Saving EN_REST_SB1_TEST.xml to EN_REST_SB1_TEST.xml


In [0]:
# SemEval Restaurant
restaurant_train = semeval_dataset['Restaurant Train 2015']
restaurant_test = semeval_dataset['Restaurant Test 2015']
restaurant_combined_2015 = TargetTextCollection.combine(restaurant_train, 
                                                        restaurant_test)
restaurant_combined_2015.name = 'Restaurant 2015'

restaurant_train = semeval_dataset['Restaurant Train 2016']
restaurant_test = semeval_dataset['Restaurant Test 2016']
restaurant_combined_2016 = TargetTextCollection.combine(restaurant_train, 
                                                        restaurant_test)
restaurant_combined_2016.name = 'Restaurant 2016'

To confirm that we have the data correct we shall check that the total number of aspects/categories for the datasets matches those that are in the papers:
1. For Restaurant 2015 we should have 2499 according to [Pontiki et al. 2015](https://www.aclweb.org/anthology/S15-2082.pdf)
2. For Restaurant 2016 we should have 3366 according to [Pontiki et al. 2016](https://www.aclweb.org/anthology/S16-1002.pdf)

In [44]:
print(f'Number of aspects in Restaurant 2015 '
      f'{restaurant_combined_2015.number_categories()}')
print(f'Number of aspects in Restaurant 2016 '
      f'{restaurant_combined_2016.number_categories()}')

Number of aspects in Restaurant 2015 2499
Number of aspects in Restaurant 2016 3366


It would appear from above that we have parsed that dataset correctly as we match the number of aspects in the original paper.

We now move on to see how many sentences and the number of aspects that are affected if we treat the task as a text level aspect task instead of taking into the target:

In [48]:
def aspects_affected(dataset: TargetTextCollection) -> None:
  number_texts_wrong = 0
  number_wrong = 0
  for key, value in dataset.items():
    aspects_in_text = value['categories']
    if aspects_in_text is None:
      assert value['targets'] is None
      continue
    number_aspects = len(aspects_in_text)
    aspect_count = Counter(aspects_in_text)
    aspect_count_diff = number_aspects - len(aspect_count)
    if aspect_count_diff != 0:
      number_texts_wrong += 1
      number_wrong += aspect_count_diff
  number_samples = dataset.number_categories()
  percent_wrong = round((number_wrong / float(number_samples)) * 100, 2)
  percent_text_wrong = round((number_texts_wrong / float(number_samples)) * 100, 2)
  print(f'For the dataset {dataset.name} which contains {number_samples} '
        f'samples and {len(dataset)} texts\n{number_texts_wrong}'
        f'({percent_text_wrong}%) texts and {number_wrong}'
        f'({percent_wrong}%) samples are affected.')
  
aspects_affected(restaurant_combined_2015)
aspects_affected(restaurant_combined_2016)

For the dataset Restaurant 2015 which contains 2499 samples and 2000 texts
188(7.52%) texts and 246(9.84%) samples are affected.
For the dataset Restaurant 2016 which contains 3366 samples and 2676 texts
255(7.58%) texts and 365(10.84%) samples are affected.


As we can see from the two datasets it would appear that up to 11% of samples are affected and up to 8% of texts are affected.

# How many implicit targets are there?

In these datasets some of the targets are implicit. This can be found when the target value is `None` while the aspect exisits. An example of this can be seen below: 

In [51]:
for key, value in restaurant_combined_2015['P#3:3'].items():
  print(f'{key}: {value}')

text: Cool atmosphere, the fire place in the back really ads to it but needs a bit more heat throughout on a cold night.
text_id: P#3:3
targets: ['atmosphere', 'fire place', None]
spans: [Span(start=5, end=15), Span(start=21, end=31), Span(start=0, end=0)]
target_sentiments: ['positive', 'positive', 'negative']
categories: ['AMBIENCE#GENERAL', 'AMBIENCE#GENERAL', 'AMBIENCE#GENERAL']
category_sentiments: None


This particular example shows how difficult it would be to identify these implicit examples as the aspect already appears explictly in the text. However we are here to answer the question of how many implicit targets there are:

In [61]:
def implicit_targets(dataset: TargetTextCollection) -> None:
  number_implicit_texts = 0
  number_implicit_targets = 0
  for key, value in dataset.items():
    targets = value['targets']
    aspects = value['categories']
    if targets is None:
      continue
    text_is_implicit = False
    for target, aspect in zip(targets, aspects):
      if target is None:
        assert not aspect is None
        number_implicit_targets += 1
        text_is_implicit = True
    if text_is_implicit:
      number_implicit_texts += 1
  number_samples = dataset.number_categories()
  percent_texts = round((number_implicit_texts / float(number_samples)) * 100, 2)
  percent_targets = round((number_implicit_targets / float(number_samples)) * 100, 2)
  print(f'For the dataset {dataset.name} which contains {number_samples} '
        f'samples and {len(dataset)} texts\n{number_implicit_texts}'
        f'({percent_texts}%) texts and {number_implicit_targets}'
        f'({percent_targets}%) samples are implicit.')
  
implicit_targets(restaurant_combined_2015)
implicit_targets(restaurant_combined_2016)

For the dataset Restaurant 2015 which contains 2499 samples and 2000 texts
566(22.65%) texts and 621(24.85%) samples are implicit.
For the dataset Restaurant 2016 which contains 3366 samples and 2676 texts
766(22.76%) texts and 832(24.72%) samples are implicit.


We can see that at least 25% of the samples contain implicit sentiment!