In [1]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
import os
import sys
import random as rn
rn.seed(42)

sys.path.append(os.path.abspath(os.pardir))

import numpy as np
np.random.seed(42)
from sklearn.model_selection import StratifiedShuffleSplit

from tdparse.helper import read_config, full_path
from tdparse.parsers import semeval_14, hu_liu
from tdparse.data_types import TargetCollection, Target
from tdparse import write_data

In [2]:
youtubean = semeval_14(full_path(read_config('youtubean')))
# Product reviews are made up of three different products: 1. Computer, 2. Router, and 3. Speaker
product_reviews_folder = full_path(read_config('product_reviews_dir'))
speaker_reviews = semeval_14(os.path.join(product_reviews_folder, 'Speaker.xml'))
computer_reviews = semeval_14(os.path.join(product_reviews_folder, 'Computer.xml'))
router_reviews = semeval_14(os.path.join(product_reviews_folder, 'Router.xml'))

# Creating Training and Test sets for the YouTuBean and Product reviews datasets
We show how we created the Training and Test sets for these two datasets.

The original YouTuBean dataset can be found [here](https://github.com/epochx/opinatt), which is associated with this [paper](https://www.aclweb.org/anthology/W17-5213). It contains {{len(youtubean)}} target sentiment labels and has {{len(youtubean.stored_sentiments())}} unique sentiments. We split the dataset into 70% traning and 30% test set.

The product review dataset of [Liu et al.](https://www.ijcai.org/Proceedings/15/Papers/186.pdf) is made up of three different Amazon product reviews (Number of target sentiments in brackets): 1. Speaker reviews ({{len(speaker_reviews)}}), 2. Computer reviews ({{len(computer_reviews)}}), and 3. Wireless Router reviews ({{len(router_reviews)}}). The original dataset can be found [here](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets). We are going to combine these reviews into one dataset of product reviews. We do this as all of the reviews are from the same text type which are product reviews. We then split this dataset into 70% train and 30% test where each of the products datasets are reprensted equally in the train and test splits, this removes the domain adaptation required as each domain is represented in the train and test splits.

## YouTuBean train and test set creation

In [3]:
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.3)
youtubean_data = np.asarray(youtubean.data_dict())
youtubean_targets = np.asarray(youtubean.sentiment_data())
for train_indexs, test_indexs in splitter.split(youtubean_data, youtubean_targets):
    youtubean_data_train = youtubean_data[train_indexs]
    youtubean_data_test = youtubean_data[test_indexs]
convert_to_targets = lambda data: [Target(**target) for target in data]
youtubean_train = TargetCollection(convert_to_targets(youtubean_data_train))
youtubean_test = TargetCollection(convert_to_targets(youtubean_data_test))
    

The dataset has now been split with respect to the class labels so each class label is represented equally in the train and test splits which can be shown here:

Train Data ratio: **{{youtubean_train.ratio_targets_sentiment()}}**
Train Data raw values: **{{youtubean_train.no_targets_sentiment()}}**

Test Data ratio: **{{youtubean_test.ratio_targets_sentiment()}}**
Test Data raw values: **{{youtubean_test.no_targets_sentiment()}}**

Original Data ratio: **{{youtubean.ratio_targets_sentiment()}}**  
Original Data raw values: **{{youtubean.no_targets_sentiment()}}**

We now save the data back to it's original XML file style format which is the same as the SemEval data format. This is so that others can use this data without having to use this code base.

In [4]:
write_data.semeval_14(full_path(read_config('youtubean_train')), youtubean_train)
write_data.semeval_14(full_path(read_config('youtubean_test')), youtubean_test)

## Product review train and test set creation

In [5]:
product_datasets = {'speaker' : speaker_reviews, 'computer' : computer_reviews, 
                    'router' : router_reviews}
train_test_datasets = {}
for product_name, product_dataset in product_datasets.items():
    product_data = np.asarray(product_dataset.data_dict())
    targets = np.asarray(product_dataset.sentiment_data())
    for train_indexs, test_indexs in splitter.split(product_data, targets):
        train_data = product_data[train_indexs]
        test_data = product_data[test_indexs]
    train_data = TargetCollection(convert_to_targets(train_data))
    test_data = TargetCollection(convert_to_targets(test_data))
    train_test_datasets[product_name] = (train_data, test_data)

Speaker train data raw values: **{{train_test_datasets['speaker'][0].no_targets_sentiment()}}**
Speaker test data raw values: **{{train_test_datasets['speaker'][1].no_targets_sentiment()}}**

Computer train data raw values: **{{train_test_datasets['computer'][0].no_targets_sentiment()}}**
Computer test data raw values: **{{train_test_datasets['computer'][1].no_targets_sentiment()}}**

Wireless Router data raw values: **{{train_test_datasets['router'][0].no_targets_sentiment()}}**
Wireless Router test data raw values: **{{train_test_datasets['router'][1].no_targets_sentiment()}}**

We now combine the train and test datasets together so that each domain is reprensted equally in the train and test datasets and the sentiment values are also equally represented

In [6]:
train_datasets = [train_test[0] for name, train_test in train_test_datasets.items()]
test_datasets = [train_test[1] for name, train_test in train_test_datasets.items()]
product_train = TargetCollection.combine_collections(*train_datasets)
product_test = TargetCollection.combine_collections(*test_datasets)

Product train data raw values: **{{product_train.no_targets_sentiment()}}** ratio **{{product_train.ratio_targets_sentiment()}}**

Product test data raw values: **{{product_test.no_targets_sentiment()}}** ratio **{{product_test.ratio_targets_sentiment()}}**

We now save the data back to it's original XML format

In [7]:
write_data.semeval_14(full_path(read_config('product_train')), product_train)
write_data.semeval_14(full_path(read_config('product_test')), product_test)