# Triplet Loss Dataset Formatter

Takes a positive and negative dataset and generates a new set in the following triplet form:

```{anchor} \t {positive} \t {negative}```

Currently focused on reading titles from mergelog in the Safegraph-merge data. Can be easily adapted to fit any other positive/negative text sets by changing column names

## Setup

In [None]:
import pandas as pd
import numpy as np

In [None]:
def split_titles(data):
    """ Splits up titles from the same row into their own rows. Titles are separated by commas """
    new_data = pd.concat([pd.Series(row['titles'].split(',')) for _, row in data.iterrows()]).reset_index()
    new_data.columns = ['index', 'titles']
    return new_data['titles']

## Loading Data

Change the datasets in this cell to provide a different positive or negative set

In [None]:
positive_filename = 'data/safegraph/merge_brands/safegraph_gamestop_raw.csv'
negative_filename = 'data/safegraph/merge_brands/safegraph_full_merge_ndomain5.csv'

pos_data = pd.read_csv(positive_filename)
neg_data = pd.read_csv(negative_filename)

## Generate Triplet Examples

Generate triplet examples from the provided positive and negative sets, then check the formatting of each row.

Titles within the same row of the negative set are split apart, but the same is not done for the positive set. We want to remember the relationship between titles of the same row, and teach that relationship to our model. Thus the `{positive}` and `{anchor}` will be taken from the same row in the positive set.

In [None]:
%%time

# Split negatives but not positives
pos_titles = pos_data['titles']
neg_titles = split_titles(neg_data)

N = 100000              # Number of examples to generate
batch_size = 100        # df.sample is a bottleneck, do more batches to reduce calls
i = 0
results = []
while i < N:
    pos_row = pos_titles.sample(batch_size)
    negative = neg_titles.sample(batch_size)
    
    for j in range(batch_size):
        try:
            pos_sam = pos_res.iloc[j].split(',')    # Seperate titles within the same positive row
            
            # Randomly select a postive and anchor from the same row; then a single negative
            # Separate each by tabs using the proper formatting
            if len(pos_sam) >= 2 and i < N:         # Make sure there are at least two titles in this row
                idxs = np.random.choice(len(pos_sam), 2, replace=False)
                results.append("{0}\t{1}\t{2}".format(pos_sam[idxs[0]], pos_sam[idxs[1]], negative.iloc[j]))
                i += 1
        except:    # Skip problematic entries (maybe investigate this more?)
            1==1

In [None]:
num_wrong = 0
for s in results:
    if s.count('\t') != 2:
        num_wrong += 1
print("Number of incorrectly formatted rows: ", num_wrong)

## Save Examples

In [None]:
filename = 'data/training/gamestop_triplet_test.txt'

with open(filename, 'w') as f:
    for item in results:
        f.write("%s\n" % item)