<a href="https://colab.research.google.com/github/apmoore1/target-extraction/blob/master/tutorials/Difference_between_MAMS_ATSA_original_and_MAMS_ATSA_cleaned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%%capture
!pip install git+git://github.com/apmoore1/target-extraction.git@master#egg=target-extraction

# Difference between MAMS ATSA original and MAMS ATSA cleaned

In this notebook we describe the subtle differences between the original MAMS ATSA from [Jiang et al. 2019](https://www.aclweb.org/anthology/D19-1654.pdf) and the cleaned version created from exploring the dataset within the Bella package. This is only for the Training split and the Validation and Test splits are have not changed from the original.

Below we load both original and cleaned training sets:

In [2]:
from target_extraction.dataset_parsers import multi_aspect_multi_sentiment_atsa
original_train = multi_aspect_multi_sentiment_atsa('train')
original_train.name = 'MAMS Original (Train)'
cleaned_train = multi_aspect_multi_sentiment_atsa('train', original=False)
cleaned_train.name = 'MAMS Cleaned (Train)'

1641623B [00:00, 26758898.30B/s]         
1641213B [00:00, 27689937.90B/s]         


After loading the datasets we report the dataset statistics below:

In [4]:
from target_extraction.analysis.dataset_statistics import dataset_target_sentiment_statistics
dataset_target_sentiment_statistics([original_train, cleaned_train],
                                    dataframe_format=True)

Unnamed: 0,Name,No. Sentences,No. Sentences(t),No. Targets,No. Uniq Targets,ATS,ATS(t),TL 1 %,TL 2 %,TL 3+ %,Mean Sentence Length,Mean Sentence Length(t),POS (%),NEU (%),NEG (%)
0,MAMS Original (Train),4297,4297,11186,2410,2.6,2.6,82.15,11.56,6.29,26.27,26.27,3380 (30.22),5042 (45.07),2764 (24.71)
1,MAMS Cleaned (Train),4297,4297,11180,2406,2.6,2.6,82.17,11.53,6.3,26.27,26.27,3379 (30.22),5038 (45.06),2763 (24.71)


As can be seen the only difference being we have 6 fewer samples/targets.

The reason for these differences is due to overlapping targets in the original dataset which can be seen below:

In [6]:
from target_extraction.tokenizers import spacy_tokenizer
original_train.tokenize(spacy_tokenizer())
sequence_errors = original_train.sequence_labels(return_errors=True)

for error in sequence_errors:
  _id = error['text_id']
  text = error['text']
  targets = error['targets']
  spans = error['spans']
  print(f'ID of error {_id}')
  print(f'targets {targets}')
  print(f'target spans {spans}')
  print(f'text {text}\n')

ID of error train$1191
targets ['beer selection', 'beer s']
target spans [Span(start=12, end=26), Span(start=12, end=18)]
text I liked the beer selection!

ID of error train$1203
targets ['table space', 'water', 'patrons waiting', 'table s']
target spans [Span(start=3, end=14), Span(start=113, end=118), Span(start=147, end=162), Span(start=3, end=10)]
text No table space and one of the angry neighbors decided to take matters into his own hands by throwing a bucket of water out his window and onto the patrons waiting for their tables.

ID of error train$2385
targets ['appetizer', 'fritas', 'spicy shrimp with coconut rice', 'coconut', 'dessert tres leches de mango with calle']
target spans [Span(start=4, end=13), Span(start=21, end=27), Span(start=47, end=77), Span(start=65, end=72), Span(start=131, end=170)]
text The appetizer, conch fritas was yummy; entree, spicy shrimp with coconut rice (just a hint of coconut - not overwhelming); and the dessert tres leches de mango with calle ocho 

As can be seen above for each TargetText there are two targets that overlap each other with respect to the Span of the text the target came from. e.g. in example 1 the targets are `targets ['beer selection', 'beer s']` of which it does not make sense to have two spans that cover the same target and in this case `beer s` appears to be an annotation mistake.

The cleaned version removes the following targets that are believed to be annotation mistakes:
1. ID `train$1191` removed `beer s`
2. ID `train$1203` removed `table s`
3. ID `train$2385` removed `coconut`
4. ID `train$2645` removed `pizza`
5. ID `train$3865` removed `clam s`
6. ID `train$3903` removed `beet s`