<a href="https://colab.research.google.com/github/apmoore1/tdsa_comparisons/blob/master/analysis/Baseline_non_target_results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%%capture
!pip install -U git+git://github.com/apmoore1/target-extraction.git@master#egg=target-extraction

In [0]:
from collections import defaultdict, Counter
from tempfile import TemporaryDirectory
from pathlib import Path
from typing import List, Dict

import requests
import pandas as pd
import numpy as np
from target_extraction.data_types import TargetTextCollection
from target_extraction.analysis.util import metric_df, long_format_metrics
from target_extraction.analysis.util import add_metadata_to_df, combine_metrics
from target_extraction.analysis.sentiment_metrics import accuracy, macro_f1
from target_extraction.analysis.statistical_analysis import one_tailed_p_value

def get_text_sentiment_distribution(dataset: TargetTextCollection
                                    ) -> Dict[str, float]:
    '''
    :param dataset: Dataset.
    :returns: A dictionary of the proportion of positive, negative, and neutral 
              samples in the given TargetTextCollection.
    '''
    text_sentiment_distribution = Counter()
    data_size = len(dataset)
    for target_text in dataset.values():
        text_sentiment_distribution.update([target_text['text_sentiment']])
    for key, value in text_sentiment_distribution.items():
        text_sentiment_distribution[key] = round((value / data_size), 2) * 100
    return dict(text_sentiment_distribution)

def get_dataset_stats(collection: TargetTextCollection, version: str
                      ) -> pd.DataFrame:
  '''
  :param collection: The TargetTextCollection to generate statistics for. 
  :param version: Either `single` or `average`.
  :returns: A DataFrame containing the following columns and one value for each 
            column: `Size` - number of samples in the collection, `Version` - 
            the version argument, `Name` the collection's dataset name, 
            `positive` - the proportion of positive samples, `neutral` - the 
            proportion of neutral samples, and `negative` - the proportion of
            negative samples.
  '''
  collection_stats = {**get_text_sentiment_distribution(collection)}
  collection_stats['Size'] = len(collection)
  collection_stats['Name'] = collection.name
  collection_stats['Version'] = version
  for key, value in collection_stats.items():
    collection_stats[key] = [value] 
  return pd.DataFrame(collection_stats)


def get_metric_results(collection: TargetTextCollection) -> pd.DataFrame:
  '''
  :param collection: Dataset that contains all of the results.
  :returns: A pandas dataframe with the following columns: `['prediction key', 
            'run number', 'Accuracy', 'data-trained-on', 'Inter-Aspect', 'CWR', 
            'Position', 'Model', 'Macro F1', 'Dataset']`
  '''
  predicted_key = list(collection.metadata['predicted_target_sentiment_key'].keys())
  acc_df = metric_df(collection, accuracy, 'target_sentiments', predicted_key,
                     array_scores=True, assert_number_labels=3, 
                     metric_name='Accuracy', average=False, include_run_number=True)
  acc_df = add_metadata_to_df(acc_df, collection, 'predicted_target_sentiment_key')
  f1_df = metric_df(collection, macro_f1, 'target_sentiments', predicted_key,
                    array_scores=True, assert_number_labels=3, 
                    metric_name='Macro F1', average=False, include_run_number=True)
  combined_df = combine_metrics(acc_df, f1_df, 'Macro F1')
  combined_df['Dataset'] = [collection.name] * combined_df.shape[0]
  combined_df['Data Split'] = [collection.metadata['split']] * combined_df.shape[0]
  return combined_df

def mean_std(data: pd.Series) -> str:
   to_percentage = data * 100
   return f'{np.mean(to_percentage):.2f} ({np.std(to_percentage):.2f})'

# Baseline non-target results
In this notebook we are comparing the *CNN(single)* and *CNN(avg)* to see which version performs better. To do see we first need to download the data from the [relevant directory within the github repository](https://github.com/apmoore1/tdsa_comparisons/tree/master/saved_results/non_target_baselines) of which the code to do this is below. Furthermore the code when loading the data also calculates the relevant metric scores (accuracy and macro f1), finds all the relevant metadata, and creates the dataset statistics for the different training datasets. All of this data is then loaded into two pandas dataframe, one for the metrics on the validation and test splits and another dataframe for the dataset statistics for single and avergae training splits:

In [3]:
# Get the data

result_base_url = Path('raw.githubusercontent.com/apmoore1/tdsa_'
                       'comparisons/master/saved_results/non_target_baselines/')
cnn_versions = ['single', 'average']
data_splits = ['test', 'val']
dataset_names = ['election', 'laptop', 'restaurant']

all_results: List[pd.DataFrame] = []

for cnn_version in cnn_versions:
  for data_split in data_splits:
    for dataset_name in dataset_names:
      data_url = Path(result_base_url, cnn_version, f'{dataset_name}_dataset', 
                      f'{data_split}.json')
      data_url = f'https://{str(data_url)}'
      with TemporaryDirectory() as temp_dir:
        temp_file = Path(temp_dir, 'temp_file')
        response = requests.get(data_url, stream=True)
        with temp_file.open('wb+') as fp:
          for chunk in response.iter_content(chunk_size=128):
            fp.write(chunk)
        data_collection = TargetTextCollection.load_json(temp_file)
        all_results.append(get_metric_results(data_collection))

results_df = pd.concat(all_results, ignore_index=True)
test_result_df = results_df[results_df['Data Split']=='Test']
val_result_df = results_df[results_df['Data Split']=='Validation']

training_dataset_stats: List[pd.DataFrame] = []
for cnn_version in cnn_versions:
  for dataset_name in dataset_names:
    data_url = Path(result_base_url, cnn_version, f'{dataset_name}_dataset', 
                    'train.json')
    data_url = f'https://{str(data_url)}'
    with TemporaryDirectory() as temp_dir:
      temp_file = Path(temp_dir, 'temp_file')
      response = requests.get(data_url, stream=True)
      with temp_file.open('wb+') as fp:
        for chunk in response.iter_content(chunk_size=128):
          fp.write(chunk)
      data_collection = TargetTextCollection.load_json(temp_file)
      data_collection.name = dataset_name.capitalize()
      training_dataset_stats.append(get_dataset_stats(data_collection, 
                                    version=cnn_version))
training_dataset_stats = pd.concat(training_dataset_stats, ignore_index=True, 
                                   sort=False)

  'precision', 'predicted', average, warn_for)


As stated above the data is loaded into a pandas DataFrame. For convience we split the entire dataframe into two, one for the test split results `test_result_df` and the other for the validation results `val_result_df`. Below shows the top 5 rows from the validation results:

In [4]:
val_result_df.head()

Unnamed: 0,prediction key,run number,Accuracy,CWR,Inter-Aspect,Model,Position,data-trained-on,Macro F1,Dataset,Data Split
24,predicted_target_sentiment_single_GloVe,0,0.540636,False,False,CNN,False,single,0.422622,Election,Validation
25,predicted_target_sentiment_single_GloVe,1,0.548488,False,False,CNN,False,single,0.392064,Election,Validation
26,predicted_target_sentiment_single_GloVe,2,0.542599,False,False,CNN,False,single,0.409426,Election,Validation
27,predicted_target_sentiment_single_GloVe,3,0.547703,False,False,CNN,False,single,0.370552,Election,Validation
28,predicted_target_sentiment_single_GloVe,4,0.549274,False,False,CNN,False,single,0.398766,Election,Validation


Below shows the training dataset statistics for the two CNN versions. From this we can see that the proportion of samples within the sentiment classes are very similar between the two. However the major difference is the absolute sample size where average is always larger as expected but in the Election case almost 100% larger. 

In [5]:
pd.pivot_table(training_dataset_stats, index='Name', 
               values=['negative', 'neutral', 'positive', 'Size'], 
               columns='Version').T

Unnamed: 0_level_0,Name,Election,Laptop,Restaurant
Unnamed: 0_level_1,Version,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Size,average,2319.0,1051.0,1378.0
Size,single,1227.0,933.0,1162.0
negative,average,49.0,43.0,24.0
negative,single,52.0,44.0,21.0
neutral,average,37.0,17.0,16.0
neutral,single,37.0,14.0,14.0
positive,average,15.0,40.0,60.0
positive,single,11.0,42.0,64.0


As we can see the dataframe contains all the relevant data and some columns that are not needed. The list here explains all of the columns that are relevant:
1. `data-trained-on` -- this states which version of the CNN it is. This will be either `single` which means that the CNN model was only trained on sentences that contain one unique sentiment. Or `average` where the model was trained on all sentences and the sentiment for each sentiment was the most frequent target sentiment in that sentence.
2. `run number` -- This determines how many times a model was trained and tested for to take into account of the random seed problem. Each model was trained and tested 8 times thus `run number` ranges from 0-7.
3. `Dataset` -- the dataset that the model was trained and tested on. This can only be `Election`, `Restaurant`, or `Laptop`.
4. `Data Split` -- which part of the dataset the results are associated with. This can only be `Test` or `Validation`. 
5. `prediction key` -- This is a unique identifier per model where one trained model can be uniquely identified by combining the `prediction key` and the `run number` columns.

Now that we know the data better we can generate the results for both the *Test* and *Validation* data splits. The results generated will be for both *Accuracy* and *Macro F1* metrics, and across all 3 datasets. The results will compare the *CNN (single)* to *CNN (average)*. Lastly as we have run each version of the model **8** times to take into account the random seed problem the results will show the mean score and the standard deviation in brackets.

Validation:

In [6]:
val_metric_results = pd.pivot_table(val_result_df, index='data-trained-on', 
                                    values=['Accuracy', 'Macro F1'], 
                                    columns='Dataset', 
                                    aggfunc={'Accuracy': mean_std, 
                                             'Macro F1': mean_std})
val_metric_results.T

Unnamed: 0_level_0,data-trained-on,average,single
Unnamed: 0_level_1,Dataset,Unnamed: 2_level_1,Unnamed: 3_level_1
Accuracy,Election,54.07 (0.56),54.54 (0.43)
Accuracy,Laptop,70.65 (0.68),69.46 (0.72)
Accuracy,Restaurant,72.31 (0.69),71.98 (0.41)
Macro F1,Election,42.74 (2.09),39.62 (1.75)
Macro F1,Laptop,66.32 (0.96),63.33 (1.70)
Macro F1,Restaurant,60.51 (1.20),58.74 (1.44)


Test:

In [7]:
test_metric_results = pd.pivot_table(test_result_df, index='data-trained-on', 
                                     values=['Accuracy', 'Macro F1'], 
                                     columns='Dataset', 
                                     aggfunc={'Accuracy': mean_std, 
                                              'Macro F1': mean_std})
test_metric_results.T

Unnamed: 0_level_0,data-trained-on,average,single
Unnamed: 0_level_1,Dataset,Unnamed: 2_level_1,Unnamed: 3_level_1
Accuracy,Election,52.35 (0.69),54.29 (0.73)
Accuracy,Laptop,68.26 (0.69),65.99 (0.80)
Accuracy,Restaurant,75.81 (0.55),75.19 (0.94)
Macro F1,Election,39.98 (2.20),39.73 (1.88)
Macro F1,Laptop,60.43 (1.36),55.36 (2.00)
Macro F1,Restaurant,59.40 (1.52),56.71 (1.63)


We can see from the results are consistent across the data splits (Test and Validation) as both contain the same ordering where *CNN (avg)* is better than *CNN (single)* for all but the Accuracy on the Election dataset. This therefore shows that in general it is better to use more of the data even if the sentiment label is noisy. 

However the difference between *CNN (single)* and *CNN (average)* can be quite small, which is shown better below where the table shows the difference between *CNN (avg)* and *CNN (single)*.

Validation score differences:

In [8]:
def metric_differences(data_split_df: pd.DataFrame, metric_names: List[str]) -> pd.DataFrame:
  '''
  Returns the difference between the Average and Single version

  :param data_split_df: Dataframe the contains the following columns 
                        `data-trained-on` and all the strings within the 
                        `metric_names` argument.
  :param metric_names: Names of columns that contain metric scores that are 
                       to be compared.
  :returns: A dataframe that contains new `Difference` columns.
  '''
  temp_df = data_split_df.copy(deep=True)
  temp_df = temp_df.set_index(['Dataset', 'run number'])
  average_df = temp_df[temp_df['data-trained-on']=='average']
  single_df = temp_df[temp_df['data-trained-on']=='single']
  for metric_name in metric_names:
    diff_df = average_df[metric_name] - single_df[metric_name]
    temp_df[f'Difference {metric_name}'] = diff_df
  temp_df = temp_df.reset_index()
  temp_df = temp_df[temp_df['data-trained-on']=='average']
  return temp_df

validation_diff_df = metric_differences(val_result_df, ['Accuracy', 'Macro F1'])
pd.pivot_table(validation_diff_df, index='Dataset',
               values=['Difference Accuracy', 'Difference Macro F1'], 
               aggfunc={'Difference Accuracy': mean_std, 
                        'Difference Macro F1': mean_std})

Unnamed: 0_level_0,Difference Accuracy,Difference Macro F1
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1
Election,-0.48 (0.67),3.12 (2.58)
Laptop,1.19 (1.14),2.99 (2.18)
Restaurant,0.34 (0.74),1.77 (1.69)


Test:

In [9]:
test_diff_df = metric_differences(test_result_df, ['Accuracy', 'Macro F1'])
pd.pivot_table(test_diff_df, index='Dataset',
               values=['Difference Accuracy', 'Difference Macro F1'], 
               aggfunc={'Difference Accuracy': mean_std, 
                        'Difference Macro F1': mean_std})

Unnamed: 0_level_0,Difference Accuracy,Difference Macro F1
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1
Election,-1.94 (1.36),0.24 (3.00)
Laptop,2.27 (1.00),5.07 (2.76)
Restaurant,0.62 (1.31),2.69 (2.82)


As we can see the differences between *CNN (avg)* and *CNN(single)* is very small espically when you take into account the standard deviation. The dataset that the *CNN (avg)* espically performs better is the Laptop dataset. This is most likely due to the fact that the Laptop dataset contains a lot more sentences that only have one unique sentiment (*DS1*).

Below we show if the *CNN (avg)* is statistically significantly better then *CNN (single)* using a one-tailed t-test for each of the metrics. For the accuracy metric we use the Welch's t-test as we can assume normality but the macro f1 we cannot therefore we use the Wilcoxon signed-rank test [Dror and Reichart 2018](https://arxiv.org/pdf/1809.01448.pdf). Furthermore we want to know if *CNN (avg)* is statistically significantly better then *CNN (single)* only on the datasets that *CNN (avg)* is better which is all the datasets for the macro-f1 metric and all but the Election for the accuracy metric.

Validation and Test:

In [0]:
def metric_p_values(data_split_df: pd.DataFrame, better_split: str, 
                    compare_split: str, datasets: List[str], 
                    metric_names: List[str], assume_metric_normal: List[bool]) -> pd.DataFrame:
  '''
  Returns the difference between the Average and Single version

  :param data_split_df: Dataframe the contains the following columns 
                        `data-trained-on` and all the strings within the 
                        `metric_names` argument.
  :param metric_names: Names of columns that contain metric scores that are 
                       to be compared.
  :returns: A dataframe that contains new `Difference` columns.
  '''
  temp_df = data_split_df.copy(deep=True)
  better_df = temp_df[temp_df['data-trained-on']==f'{better_split}']
  compare_df = temp_df[temp_df['data-trained-on']==f'{compare_split}']
  metric_dataset_p_value = defaultdict(dict)
  for dataset in datasets:
    better_dataset_df = better_df[better_df['Dataset']==dataset]
    compare_dataset_df = compare_df[compare_df['Dataset']==dataset]
    for metric_index, metric_name in enumerate(metric_names):
      assume_normal = assume_metric_normal[metric_index]
      better_scores = better_dataset_df[f'{metric_name}']
      compare_scores = compare_dataset_df[f'{metric_name}']
      p_value = one_tailed_p_value(better_scores, compare_scores, 
                                   assume_normal=assume_normal)
      metric_dataset_p_value[dataset][metric_name] = p_value
  return metric_dataset_p_value

In [24]:
metric_names = ['Accuracy', 'Macro F1']
all_dataset_names = ['Election', 'Laptop', 'Restaurant']

validation_p_values = pd.DataFrame(metric_p_values(val_result_df, 'average', 
                                   'single', all_dataset_names, metric_names, 
                                   [True, False]))
test_p_values = pd.DataFrame(metric_p_values(test_result_df, 'average', 
                             'single', all_dataset_names, metric_names, 
                             [True, False]))
combined_test_p_values = test_p_values.reset_index()
combined_test_p_values['Split'] = ['Test'] * combined_test_p_values.shape[0]
combined_validation_p_values = validation_p_values.reset_index()
combined_validation_p_values['Split'] = ['Validation'] * combined_validation_p_values.shape[0]
combined_p_values = pd.concat([combined_test_p_values, combined_validation_p_values], 
                              ignore_index=True)
combined_p_values = combined_p_values.rename(columns={"index": "Metric", "B": "c"})
combined_p_values = pd.melt(combined_p_values, id_vars=['Metric', 'Split'], 
                            value_vars=['Election', 'Laptop', 'Restaurant'], 
                            var_name='Dataset', value_name='P-Value')
pd.pivot_table(combined_p_values, index=['Split', 'Metric'], 
               columns='Dataset', values='P-Value')



Unnamed: 0_level_0,Dataset,Election,Laptop,Restaurant
Split,Metric,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Test,Accuracy,0.999921,2.9e-05,0.077711
Test,Macro F1,0.444319,0.005859,0.024975
Validation,Accuracy,0.950955,0.003293,0.144942
Validation,Macro F1,0.005859,0.005859,0.017846


We can see above that *CNN (avg)* is only statistically significantly better than *CNN (single)* on the Laptop dataset for both metrics and the Restaurant datsset for for the Macro F1 metric with an $\alpha < 0.05$. Furthermore if we use the Bonferroni correction to take into account comparing multiple P-Values for each of the metrics we find the following: 

In [25]:
from target_extraction.analysis.statistical_analysis import find_k_estimator

print('Validation Split')
for metric_name in metric_names:
  sig_datasets = find_k_estimator(validation_p_values.loc[f'{metric_name}', :].values, 
                                  alpha=0.05, method='B')
  print(f'For metric {metric_name} there are {sig_datasets} where CNN (avg) is '
        'statistically significantly better than CNN (single) with a '
        'confindence >=95%')
  

print('Test Split')
for metric_name in metric_names:
  sig_datasets = find_k_estimator(test_p_values.loc[f'{metric_name}', :].values, 
                                  alpha=0.05, method='B')
  print(f'For metric {metric_name} there are {sig_datasets} where CNN (avg) is '
        'statistically significantly better than CNN (single) with a '
        'confindence >=95%')

Validation Split
For metric Accuracy there are 1 where CNN (avg) is statistically significantly better than CNN (single) with a confindence >=95%
For metric Macro F1 there are 3 where CNN (avg) is statistically significantly better than CNN (single) with a confindence >=95%
Test Split
For metric Accuracy there are 1 where CNN (avg) is statistically significantly better than CNN (single) with a confindence >=95%
For metric Macro F1 there are 2 where CNN (avg) is statistically significantly better than CNN (single) with a confindence >=95%


To conclude it is better to use the *CNN (avg)* as a strong baseline across the datasets. Furthermore *CNN (avg)* performance is far superior when using the Macro F1 metric, of which this is most likely due to the very few samples the *single* dataset has for the minority class.