In [1]:
from pathlib import Path
import random as rn
rn.seed(42)

import numpy as np
np.random.seed(42)
from sklearn.model_selection import StratifiedShuffleSplit
from bella.parsers import semeval_14
from bella.data_types import TargetCollection, Target
from bella import write_data

import config

Using TensorFlow backend.


# Creating Training and Test sets for the YouTuBean Dataset
We show how we created the Training and Test sets for this dataset.

The original Dataset can be downloaded from [here](https://raw.githubusercontent.com/epochx/opinatt/master/samsung_galaxy_s5.xml) and the accompying paper can be found [here](https://www.aclweb.org/anthology/W17-5213). As Marrese-Taylor et al. Evaluated their models on 5 fold cross validation they do not have one train, test set therefore we create a 70% train and 30% test dataset, we then save the new train and test dataset in XML format just like the original format. This format is the same as [SemEval 2014](http://alt.qcri.org/semeval2014/task4/) XML format.

In [2]:
# YouTube dataset
youtubean_with_conflicts = semeval_14(config.youtubean_original, conflict=True)
youtubean = semeval_14(config.youtubean_original, conflict=False)
print(f'''
Shown above is two ways to parse the data. The first one contains labels called conflicts which is when the sentiment is both positive and negative. The second removes the conflict labels. The original analysis by Marrese-Taylor mapped the conflict labels to neutral labels. However all of our experiments we remove the conflict labelled data.

We thought it would be best to show you here how you can parse the data differently. Also we parse the data with conflicts so that we can show that we have parsed the data correctly as the statitics from Marrese-Taylor paper assume you include the conflict data. In there paper they state that there are 525 uniqie aspect terms and that is exactly what we got as shown here:

{youtubean_with_conflicts.number_unique_targets()}

In the dataset that we are using which contains **no conflict labels** there are: {len(youtubean)} targets
''')


Shown above is two ways to parse the data. The first one contains labels called conflicts which is when the sentiment is both positive and negative. The second removes the conflict labels. The original analysis by Marrese-Taylor mapped the conflict labels to neutral labels. However all of our experiments we remove the conflict labelled data.

We thought it would be best to show you here how you can parse the data differently. Also we parse the data with conflicts so that we can show that we have parsed the data correctly as the statitics from Marrese-Taylor paper assume you include the conflict data. In there paper they state that there are 525 uniqie aspect terms and that is exactly what we got as shown here:

525

In the dataset that we are using which contains **no conflict labels** there are: 798 targets



In [3]:
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

youtubean_data = np.asarray(youtubean.data_dict())
youtubean_sentiment = np.asarray(youtubean.sentiment_data())
for train_indexs, test_indexs in splitter.split(youtubean_data, youtubean_sentiment):
    train_data = youtubean_data[train_indexs]
    test_data = youtubean_data[test_indexs]
    
convert_to_targets = lambda data: [Target(**target) for target in data]
youtubean_train = TargetCollection(convert_to_targets(train_data))
youtubean_test = TargetCollection(convert_to_targets(test_data))
print(f'''
The dataset has now been split with respect to the class labels so each class label is represented equally in the train and test splits which can be shown here:

Train Data ratio: {youtubean_train.ratio_targets_sentiment()}
Train Data raw values: {youtubean_train.no_targets_sentiment()}

Test Data ratio: {youtubean_test.ratio_targets_sentiment()}
Test Data raw values: {youtubean_test.no_targets_sentiment()}

Original Data ratio: {youtubean.ratio_targets_sentiment()}
Original Data raw values: {youtubean.no_targets_sentiment()}

We now save the data to XML file format which is the same as the SemEval data format.
''')


The dataset has now been split with respect to the class labels so each class label is represented equally in the train and test splits which can be shown here:

Train Data ratio: {1: 0.28, 0: 0.63, -1: 0.09}
Train Data raw values: {1: 157, 0: 352, -1: 49}

Test Data ratio: {0: 0.63, -1: 0.09, 1: 0.28}
Test Data raw values: {0: 152, -1: 21, 1: 67}

Original Data ratio: {0: 0.63, 1: 0.28, -1: 0.09}
Original Data raw values: {0: 504, 1: 224, -1: 70}

We now save the data to XML file format which is the same as the SemEval data format.



In [4]:
write_data.semeval_14(config.youtubean_train, youtubean_train)
write_data.semeval_14(config.youtubean_test, youtubean_test)