## Build Dataset

This notebook generaates dataset from a integrated `.jsonl` file to the datasets


### Steps
1. Partition
2. Align
3. Merge
4. Convert to Dataset

In [1]:
import os
from partition import partition
from align import align
from merge import merge_datasets
from convert_to_datasets import create_dataset




In [2]:
def create_path(path: str):
    """Creates path if not exists."""
    if not os.path.exists(path):
        os.makedirs(path)
        print(f"Created the path: {path}")


In [3]:
input_base_path = '/shared/3/projects/hiatus/rotten_tomatoes'
# hrs_release_08 = [
#     'hrs1_08-14-23_background_bgg_350_anonymized.jsonl',
#     'hrs1_08-14-23_background_globalvoices_anonymized.jsonl',
#     'hrs1_08-14-23_background_instructables_anonymized.jsonl',
#     'hrs1_08-14-23_background_stackexchange_literature_anonymized.jsonl',
#     'hrs1_08-14-23_background_stackexchange_stem_anonymized.jsonl',
#     'hrs1_08-14-23_boardgamegeek_foreground_anonymized.jsonl',
#     'hrs1_08-14-23_globalvoices_foreground_anonymized.jsonl',
#     'hrs1_08-14-23_instructables_foreground_anonymized.jsonl',
#     'hrs1_08-14-23_stackexchangehumanities_foreground_anonymized.jsonl',
#     'hrs1_08-14-23_stackexchangestem_foreground_anonymized.jsonl']
# hrs_release_08_names = [
#     'background_bgg_350',
#     'background_globalvoices',
#     'background_instructables',
#     'background_stackexchange_literature',
#     'background_stackexchange_stem',
#     'boardgamegeek_foreground',
#     'globalvoices_foreground',
#     'instructables_foreground',
#     'stackexchangehumanities_foreground',
#     'stackexchangestem_foreground'
# ]

hrs_file = "rtcorpus.jsonl"
output_name = "rtcorpus"

### Set dirs.
input_file = os.path.join(input_base_path,'raw_data', hrs_file)
output_path = os.path.join(input_base_path, 'generated_data', output_name)
create_path(output_path)
print("Input file path:", input_file)
print("Output file path:", output_path)


Created the path: /shared/3/projects/hiatus/rotten_tomatoes/generated_data/rtcorpus
Input file path: /shared/3/projects/hiatus/rotten_tomatoes/raw_data/rtcorpus.jsonl
Output file path: /shared/3/projects/hiatus/rotten_tomatoes/generated_data/rtcorpus


# Step1: Partition

Generate the queries and candidates from the source data .jsonl file

```
Input: 'hrs1_08-14-23_background_bgg_350_anonymized.jsonl'
Output: 
        `dev_candidates.jsonl`
        `test_candidates.jsonl`
        `train_candidates.jsonl`
        `dev_queries.jsonl`
        `test_queries.jsonl`
        `train_queries.jsonl`
```

In [4]:
nrows = None  # None to use the whole file
partition(input_file, output_path, nrows)


INFO:root:Loading meta data...
INFO:root:40 dev samples
INFO:root:80 test samples
INFO:root:678 training samples


Sampling text pairs...
saving text pair samples


25175it [00:00, 25812.67it/s]
25175it [00:00, 33540.05it/s]
25175it [00:00, 33448.23it/s]


# Step2: Align

Align the authors in the candidate and query files and assert that are no overlapping documents in the files

In [5]:
print("Aligning dev dataset!")
align(os.path.join(output_path, 'dev_candidates.jsonl'), os.path.join(output_path, 'dev_queries.jsonl'))

print("Aligning test dataset!")
align(os.path.join(output_path, 'test_candidates.jsonl'), os.path.join(output_path, 'test_queries.jsonl'))

print("Aligning train dataset!")
align(os.path.join(output_path, 'train_candidates.jsonl'), os.path.join(output_path, 'train_queries.jsonl'))


Aligning dev dataset!
Aligning test dataset!
Aligning train dataset!


# Step3: Merge

Merge the candidates and queries into a single file for each split. E.g. `train.jsonl`


In [6]:
print("Merging train data")
train_input_paths = [(os.path.join(p, 'train_queries.jsonl'), os.path.join(p, 'train_candidates.jsonl')) for p in [output_path]]
merge_datasets(train_input_paths, os.path.join(output_path, 'train.jsonl'))

print("Merging dev data")
dev_input_paths = [(os.path.join(output_path, 'dev_queries.jsonl'), os.path.join(output_path, 'dev_candidates.jsonl')) for p in [output_path]]
merge_datasets(dev_input_paths, os.path.join(output_path, 'dev.jsonl'))

print("Merging test data")
test_input_paths = [(os.path.join(output_path, 'test_queries.jsonl'), os.path.join(output_path, 'test_candidates.jsonl')) for p in [output_path]]
merge_datasets(test_input_paths, os.path.join(output_path, 'test.jsonl'))    


Merging train data
/shared/3/projects/hiatus/rotten_tomatoes/generated_data/rtcorpus/train_queries.jsonl
/shared/3/projects/hiatus/rotten_tomatoes/generated_data/rtcorpus/train_candidates.jsonl


678it [00:00, 3818.06it/s]


Merging dev data
/shared/3/projects/hiatus/rotten_tomatoes/generated_data/rtcorpus/dev_queries.jsonl
/shared/3/projects/hiatus/rotten_tomatoes/generated_data/rtcorpus/dev_candidates.jsonl


40it [00:00, 3421.97it/s]


Merging test data
/shared/3/projects/hiatus/rotten_tomatoes/generated_data/rtcorpus/test_queries.jsonl
/shared/3/projects/hiatus/rotten_tomatoes/generated_data/rtcorpus/test_candidates.jsonl


80it [00:00, 3898.82it/s]


# Step4: Convert To Datasets

In [7]:
print('Creating training dataset...')
train_inpath = os.path.join(output_path, 'train.jsonl')
train_outpath = os.path.join(output_path, 'train')
create_dataset(train_inpath, train_outpath)

print('Creating dev dataset...')
dev_inpath = os.path.join(output_path, 'dev.jsonl')
dev_outpath = os.path.join(output_path, 'dev')
create_dataset(dev_inpath, dev_outpath)

print('Creating test dataset...')
test_inpath = os.path.join(output_path, 'test.jsonl')
test_outpath = os.path.join(output_path, 'test')
create_dataset(test_inpath, test_outpath)


Creating training dataset...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/678 [00:00<?, ? examples/s]

Creating dev dataset...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/40 [00:00<?, ? examples/s]

Creating test dataset...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/80 [00:00<?, ? examples/s]