#### This notebook explores different ways of drawing samples that correspond to "dataset 2". Initially, "dataset 2" is a dataset with which we obtained a good prediction model for augmentation classes, but we need to verify whether we just "got lucky" or if we can re-draw this dataset a bunch of times and get to similar results. 

#### Here, we draw samples from a larger dataset, namely "dataset 3", where each query Qi is randomly paired with multiple candidate datasets from different datasets. 

In [1]:
import pandas as pd

dataset_3 = pd.read_csv('training-simplified-data-generation-many-candidates-per-query_with_median_and_mean_based_classes.csv')
original_dataset_2 = pd.read_csv('training-simplified-data-generation_with_median_and_mean_based_classes.csv')


In [2]:
## get the numbers of positive and negative gains for both datasets

print('negative in dataset_3', dataset_3.loc[dataset_3['gain_in_r2_score'] <= 0].shape[0], 'positive in dataset_3', dataset_3.loc[dataset_3['gain_in_r2_score'] > 0].shape[0])
print('negative in original_dataset_2', original_dataset_2.loc[original_dataset_2['gain_in_r2_score'] <= 0].shape[0], 'positive in original_dataset_2', original_dataset_2.loc[original_dataset_2['gain_in_r2_score'] > 0].shape[0])

('negative in dataset_3', 1097156, 'positive in dataset_3', 1019700)
('negative in original_dataset_2', 4177, 'positive in original_dataset_2', 5707)


#### Both datasets look relatively balanced. Let's draw other "versions" of dataset 2 by getting dataset 3 and, for each \<Qi, Cj\> with  gain_marker = 'positive', get one “negative”. They must have the same query id.

In [12]:
import numpy as np
import random

def create_version_of_dataset_2(larger_dataset, n_queries, one_candidate_per_query=True):
    """This function draws candidates from larger_dataset for n_queries of its queries. 
    
    If one_candidate_per_query == True, it only draws one candidate, with either 
    gain_marker == 'positive' or gain_marker == 'negative', per query. Otherwise, it 
    draws two candidates (one with gain_marker == 'positive' and one with gain_marker == 'negative')
    """
    
    queries = np.random.choice(list(set(larger_dataset['query'])), n_queries)
    subdatasets = []
    for q in queries:
        subtable = larger_dataset.loc[larger_dataset['query'] == q]
        if one_candidate_per_query:
            sample = subtable.loc[random.sample(list(subtable.index), 1)]
        else:
            positives = subtable.loc[subtable['gain_marker'] == 'positive']
            sample_positive = positives.loc[random.sample(list(positives.index), 1)]
            negatives = subtable.loc[subtable['gain_marker'] == 'negative']
            sample_negative = negatives.loc[random.sample(list(negatives.index), 1)]
            sample = pd.concat([sample_positive, sample_negative])
        subdatasets.append(sample)
    return pd.concat(subdatasets)

In [None]:
## Draw versions of dataset 2 with two candidates per query (one with gain_marker == 'positive' 
## and one with gain_marker == 'negative'), and with one candidate per query (either gain_marker == 'positive' 
## or gain_marker == 'negative')

NUMBER_OF_VERSIONS_WITH_ONE_CANDIDATE_PER_QUERY = 10
NUMBER_OF_VERSIONS_WITH_TWO_CANDIDATES_PER_QUERY = 10
NUMBER_OF_QUERIES = len(set(original_dataset_2['query']))
ocpq = 0 #one candidate per query
while ocpq < NUMBER_OF_VERSIONS_WITH_ONE_CANDIDATE_PER_QUERY:
    dataset = create_version_of_dataset_2(dataset_3, NUMBER_OF_QUERIES)
    ocpq += 1
    break
dataset.head