# Datasets
---

In this notebook we'll build/implement the Dataset classes we need to work with all the dataset we have.
First we will introduce the datasets, then we will separate those based on the usage we are going to make of them, then we will use/build/implement our classes in order to manage those different datasets and tasks.

# 0.0 Utils
---

We will be using the 🤗*Datasets* library, so first we need to install it.

> !pip install datasets > datasets_installation.txt

We'll be using also the 🤗 *Tranformers* library, as we need a tokenizer and a vocab.

> !pip install transformers > transformers_installation.txt

We'll be using also `pydash`, a python library inspired on lodash.

> !pip install pydash > pydash_installation.txt

We'll be using (for loggin) Weigths and Biases (`wandb`) so we are going to install it, independently from Hugging face, and use it within it.

> !pip install wandb -qq

Let's define all the `imports` and `hyperparameters` in one place.

In [1]:
# Imports
import time
import os
import json
from transformers import AutoTokenizer, AutoModel, PreTrainedTokenizer, DataCollatorForLanguageModeling, Trainer, BertForMaskedLM, TrainingArguments
import torch
import transformers
from torch.utils.data import Dataset, DataLoader
from typing import Dict, List, Union
from datasets import load_dataset, DatasetDict
from enum import Enum
# from pydash.arrays import pull_at
import numpy as np
from torch.nn.utils.rnn import pad_sequence
import math

import wandb

In [3]:
# ----------------------------------- #
#           Hyperparameters
# ----------------------------------- #
RUN_NAME = 'scibert-s2orc'
RUN_NUMBER = 2
RUN_ITER = 1


# --------- logging         --------- #
verbose = True # logging function description start and end
debug = False # logging element values
print_debug = False
time_debug = True

transformers.logging.set_verbosity_info()

wandb.login()
# Optional: log both gradients and parameters
%env WANDB_WATCH=all

# --------- preprocessing   --------- #
# in **partial_prepare_data**
remove_None_papers = True # if True, remove papers with None eather in abstract or title
remove_Unused_columns = True
# in **preprocess**
clean_None_data = False # if True, changes all the None (abstract of title) to ''
remove_None_data = False # if True (and clean_None_data set False), remove all the None abstract/title and the correspond title/abstract

# --------- paths           --------- #
# data folder path
data_base_dir = '/home/vivoli/Thesis/data'
s2orc_type = 'full'
N = None

# --------- model/tokenizer --------- #
# hugginface model/tokenizer name
MODEL_PATH = 'allenai/scibert_scivocab_uncased'
model_name_or_path = MODEL_PATH

# --------- checkpoint model -------- #
from transformers.trainer_utils import get_last_checkpoint

# seed for reproducibility of experiments
SEED = 1234

[34m[1mwandb[0m: Currently logged in as: [33memanuelevivoli[0m (use `wandb login --relogin` to force relogin)


env: WANDB_WATCH=all


# 0.1 KeyPhrase Dataset
---

These are testing datasets

- Keyphrase paths

```python
dataset_names = ['inspec', 'krapivin', 'nus', 'semeval', 'kp20k', 'duc', 'stackexchange']
json_base_dir = data_base_dir + '/keyphrase/json/'
```

- Keyphrase read json file

```python
def json_keyphrase_read( dataset_name, json_base_dir, file_name=None ):
    """
    Args:
        dataset_name (string): Directory name with the json file.
        json_base_dir (string): Path to the Dataset directory.
        file_name (string): (Optional) Json file name.
        
    Return:
        json_list (list of dict): List of dictionaries, each one with the fields 
            - 'title' (string)
            - 'abstract' (string)
            - 'fulltext' (string | '')
            - 'keywords' (list)
    """
    if verbose: print(dataset_name)

    input_json_path = os.path.join(json_base_dir, dataset_name, 
                                   '%s_test.json' % dataset_name if file_name is None else file_name)

    json_list_of_dict = []
    with open(input_json_path, 'r') as input_json:
        for json_line in input_json:
            json_dict = json.loads(json_line)

            if dataset_name == 'stackexchange':
                json_dict['abstract'] = json_dict['question']
                json_dict['keywords'] = json_dict['tags']            
                del json_dict['question']
                del json_dict['tags']

            keywords = json_dict['keywords']

            if isinstance(keywords, str):
                keywords = keywords.split(';')
                json_dict['keywords'] = keywords

            json_list_of_dict.append(json_dict)
                
    return json_list_of_dict
```

This keyphrase dataset could be useful for testing some model on keyphrase task or abstract-title summarization/generation/embedding.

For now, we can avoid implementing the Dataset's and DataLoader's classes for this objects.

Although, the dataset and Dataloader would be simple as follow:

- Keyphrase process data -> tensor

```python
def data_keyphrase_process(json_list_of_dict, tokenizer, debug=False):
    data = []
    for json_dict in json_list_of_dict:
        title_tensor_ = torch.tensor(tokenizer.encode(json_dict['title']),
                                dtype=torch.long)
        if debug: print(title_tensor_)
        abstract_tensor_ = torch.tensor(tokenizer.encode(json_dict['abstract']),
                                dtype=torch.long)
        if debug: print(abstract_tensor_)
        fulltext_tensor_ = torch.tensor(tokenizer.encode(json_dict['fulltext']),
                                dtype=torch.long)
        if debug: print(fulltext_tensor_)
        keywords_tensor_ = torch.tensor(tokenizer(json_dict['keywords'], padding=True)['input_ids'], 
                                dtype=torch.long)
        if debug: print(keywords_tensor_)

        data.append((title_tensor_, abstract_tensor_, fulltext_tensor_, keywords_tensor_))
    return data
```

- Defining model and vocab

```python
# we need to get `vocab` and the `tokenizer`, all comes with *AutoTokenizer*
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModel.from_pretrained(MODEL_PATH)
```

- Extracting tokenized data
```python
# now we can use them
json_list_of_dict = json_keyphrase_read( dataset_names[0], json_base_dir )
data = data_keyphrase_process( json_list_of_dict, tokenizer )
```

The `data` object is composed by `500 tuples`, each one composed by 4 objects:
- `title_tensor_` is the title embedding (composed by integers values)
- `abstract_tensor_` is the abstract embedding (composed by integers values)
- `fulltext_tensor_` is the fulltext embedding (composed by integers values)
- `keywords_tensor_` is the keywords embedding (composed by integers values)

# 0.2 S2ORC Dataset
---

To build a generic loading function we take inspiration from [here](https://discuss.huggingface.co/t/pipeline-with-custom-dataset-tokenizer-when-to-save-load-manually/1084/11).

In [4]:
if s2orc_type == 'sample':
    metadata_filename = 'sample'
    pdf_parses_filename = 'sample'
    
    
elif s2orc_type == 'full':
    
    if N is None:
        print(f"[WARNING] You set 'full' but no bucket index was specified. \n \
        We'll use the index 0, so the first bucket will be used.")
        N = 0
        
    metadata_filename = f"metadata_{N}"
    pdf_parses_filename = f"pdf_parses_{N}"
    
else:
    raise NameError(f"You must select an existed S2ORC dataset \n \
                You selected {s2orc_type}, but options are ['sample' or 'full']")

meta_s2orc = data_base_dir +f'/s2orc-{s2orc_type}-20200705v1/{s2orc_type}/metadata/{metadata_filename}.jsonl'
pdfs_s2orc = data_base_dir +f'/s2orc-{s2orc_type}-20200705v1/{s2orc_type}/pdf_/{pdf_parses_filename}.jsonl'

         We'll use the index 0, so the first bucket will be used.


(TODO merge)
```python
class S2orcDataField(Enum):
    TITLE: List[str] = ["title"]
    ABSTRACT: List[str] = ["abstract"]
    PAPER_ID: List[str] = ["paper_id"]
    YEAH: List[str] = ["year"]
    MAG_FIELD_OF_STUDY: List[str] = ["mag_field_of_study"]
    S2_URL: List[str] = ["s2_url"]
    TITLE_ABSTRACT: List[str] = ["title", "abstract"]
```

(TODO merge original function)
```python
def prepare_data(dataset_f: str,
                tokenizer: PreTrainedTokenizer,
                max_seq_length: int = None,
                batch_size: int = 64,
                num_workers: int = 0,
                seed: int = SEED,
                data_field: List[str] =  ["title", "abstract"]) -> Dict[str, DataLoader]:
    """Given an input file, prepare the train, test, validation dataloaders.
    :param dataset_f: input file (format: .txt; line by line)
    :param tokenizer: pretrained tokenizer that will prepare the data, i.e. convert tokens into IDs
    :param max_seq_length: maximal sequence length. Longer sequences will be truncated
    :param batch_size: batch size for the dataloaders
    :param num_workers: number of CPU workers to use during dataloading. On Windows this must be zero
    :return: a dictionary containing train, test, validation dataloaders
    """
    max_seq_length = tokenizer.model_max_length if not max_seq_length else max_seq_length

    def preprocess(sentences: List[str]): #-> Dict[str, Union[list, Tensor]]:
        """Preprocess the raw input sentences from the text file.
        :param sentences: a list of sentences (strings)
        :return: a dictionary of "input_ids"
        """
        tokens = [s.strip().split() for s in sentences]
        tokens = [t[:max_seq_length - 1] + [tokenizer.eos_token] for t in tokens]

        # The sequences are not padded here. we leave that to the dataloader in a collate_fn
        # ----------------------------------------------- #
        # -------- TODO include the `collate_fn` -------- #
        # ----------------------------------------------- #
        # That means: a bit slower processing, but a smaller saved dataset size
        encoded_d = tokenizer(tokens,
                             add_special_tokens=False,
                             is_pretokenized=True,
                             return_token_type_ids=False,
                             return_attention_mask=False)

        return {"input_ids": encoded_d["input_ids"]}

    dataset_dict = load_dataset("json", data_files=dataset_f)
    # dataset = Dataset.from_dict({"text": Path(dataset_f).read_text(encoding="utf-8").splitlines()})
    dataset = dataset_dict['train']
    # 90% (train), 20% (test + validation)
    train_testvalid = dataset.train_test_split(test_size=0.2, seed=SEED)
    # 10% of total (test), 10% of total (validation)
    test_valid = train_testvalid["test"].train_test_split(test_size=0.5, seed=SEED)

    dataset = DatasetDict({"train": train_testvalid["train"],
                          "test": test_valid["test"],
                          "valid": test_valid["train"]})
    print(dataset)
    """
    choose one of the dataset columns:
    - IMPORTANT fields: 
        'title', 'authors', 'abstract', 
    - LESS important fields: 
        'paper_id', 'year', 'arxiv_id', 'acl_id', 'pmc_id', 'pubmed_id', 'doi', 
        'venue', 'journal', 'mag_id', 'mag_field_of_study', 
        'outbound_citations', 'inbound_citations', 'has_outbound_citations', 'has_inbound_citations', 
        'has_pdf_body_text', 'has_pdf_parse', 'has_pdf_parsed_abstract', 'has_pdf_parsed_body_text', 'has_pdf_parsed_bib_entries', 'has_pdf_parsed_ref_entries', 
        's2_url'
    """
    dataset = dataset.map(preprocess, input_columns=data_field, batched=True)
    dataset.set_format("torch", columns=["input_ids"])

    return {partition: DataLoader(ds,
                                 batch_size=batch_size,
                                 shuffle=True,
                                 num_workers=num_workers,
                                 pin_memory=True) for partition, ds in dataset.items()}
```

(TODO merge function)
```python
# tokenizer from 'allenai/scibert_scivocab_uncased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# tokenizer.add_special_tokens({"eos_token": "[EOS]"})
DATA_FIELD =  ["abstract"]

prepare_data(meta_s2orc, tokenizer, DATA_FIELD)
```

---
---
## ❌ PARTIAL PREPARE
---
---

In [5]:
# Detecting last checkpoint.
last_checkpoint = None

output_dir=f'./tmp_trainer/#{RUN_NUMBER}_{RUN_ITER}_{RUN_NAME}'
do_train = True
overwrite_output_dir = False

if os.path.isdir(output_dir) and do_train and not overwrite_output_dir:
    last_checkpoint = get_last_checkpoint(output_dir)
    if last_checkpoint is None and len(os.listdir(output_dir)) > 0:
        raise ValueError(
            f"Output directory ({output_dir}) already exists and is not empty. "
            "Use --overwrite_output_dir to overcome."
        )
    elif last_checkpoint is not None:
        # logging.info
        print(
            f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
            "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
        )

if last_checkpoint is not None:
    checkpoint = last_checkpoint
elif model_name_or_path is not None and os.path.isdir(model_name_or_path):
    checkpoint = model_name_or_path
else:
    checkpoint = None

print(checkpoint)
print(output_dir)

NameError: name 'RUN_NUMBER' is not defined

In [4]:
# tokenizer from 'allenai/scibert_scivocab_uncased' or from checkpoint
if checkpoint is not None:
    PRETRAINED = checkpoint 
    # logging.info
    print(f"Checkpoint detected, model load from {PRETRAINED}.")
else:
    PRETRAINED = MODEL_PATH
    # logging.info
    print(f"Checkpoint DOESN'T exist, model load from scratch at {PRETRAINED}.")
    
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED) 
model = BertForMaskedLM.from_pretrained(PRETRAINED) 

NameError: name 'checkpoint' is not defined

Model BertConfig object for 

```python
Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.3.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 31090
}
```

Max sequence length from tokenizer, model and input might be differents:

In [7]:
max_seq_length = model.config.max_position_embeddings
print(max_seq_length)

512


In [None]:
def partial_prepare_data(dataset_f: str,
                tokenizer: PreTrainedTokenizer,
                max_seq_length: int = None,
                batch_size: int = 64,
                num_workers: int = 4,
                seed: int = SEED,
                data_field: List[str] =  ["title", "abstract", ""]) -> Dict[str, DataLoader]:
    """Given an input file, prepare the train, test, validation dataloaders.
    :param dataset_f: input file (format: .txt; line by line)
    :param tokenizer: pretrained tokenizer that will prepare the data, i.e. convert tokens into IDs
    :param max_seq_length: maximal sequence length. Longer sequences will be truncated
    :param batch_size: batch size for the dataloaders
    :param num_workers: number of CPU workers to use during dataloading. On Windows this must be zero
    :return: a dictionary containing train, test, validation dataloaders
    """
    print_all_debug = False
    time_debug = True
    print_some_debug = True

    ## ------------------ ##
    ## -- LOAD DATASET -- ##
    ## ------------------ ##
    if time_debug: start = time.time()
    if time_debug: start_load = time.time()
        
    ## execution
    max_seq_length = tokenizer.model_max_length if not max_seq_length else max_seq_length
    if print_some_debug: print(max_seq_length)
    dataset_dict = load_dataset("json", data_files=dataset_f)

    if time_debug: end_load = time.time()
    if time_debug: print(f"[TIME] load_dataset: {end_load - start_load}")
    
    ## ------------------ ##
    ## ---- MANAGING ---- ##
    ## ------------------ ##
    if time_debug: start_selection = time.time()
    
    ## execution
    dataset = dataset_dict['train']
    
    if time_debug: end_selection = time.time()
    if time_debug: print(f"[TIME] dataset_train selection: {end_selection - start_selection}")
    if print_all_debug: print(dataset)
   
    ## ------------------ ##
    ## --- REMOVE none -- ##
    ## ------------------ ##
    if time_debug: start_removing = time.time()
    # clean input removing papers with **None** as abstract/title
    if remove_None_papers:

        ## --------------------- ##
        ## --- REMOVE.indexes -- ##
        ## --------------------- ##
        if time_debug: start_removing_indexes = time.time()
        if print_all_debug: print(data_field)
        
        ## execution
        none_papers_indexes = {}
        for field in data_field:
            none_indexes = [ idx_s for idx_s, s in enumerate(dataset[f"{field}"]) if s is None]
            none_papers_indexes = {**none_papers_indexes, **dict.fromkeys(none_indexes , False)}

        if time_debug: end_removing_indexes = time.time()
        if time_debug: print(f"[TIME] remove.indexes: {end_removing_indexes - start_removing_indexes}")
        if print_all_debug: print(none_papers_indexes)
        
        ## --------------------- ##
        ## --- REMOVE.concat --- ##
        ## --------------------- ##
        if time_debug: start_removing_concat = time.time()
        
        ## execution
        to_remove_indexes = list(none_papers_indexes.keys())

        if time_debug: end_removing_concat = time.time()
        if time_debug: print(f"[TIME] remove.concat: {end_removing_concat - start_removing_concat}")
        if print_all_debug: print(to_remove_indexes)
        if print_all_debug: print([ dataset["abstract"][i] for i in to_remove_indexes])

        ## --------------------- ##
        ## --- REMOVE.filter --- ##
        ## --------------------- ##
        if time_debug: start_removing_filter = time.time()
        
        ## execution
        dataset = dataset.filter((lambda x, ids: none_papers_indexes.get(ids, True)), with_indices=True)
        
        if time_debug: end_removing_filter = time.time()
        if time_debug: print(f"[TIME] remove.filter: {end_removing_filter - start_removing_filter}")
        if print_all_debug: print(dataset)

        
    if time_debug: end_removing = time.time()
    if time_debug: print(f"[TIME] remove None fields: {end_removing - start_removing}")

    ## --------------------- ##
    ## --- REMOVE.column --- ##
    ## --------------------- ##
    if time_debug: start_remove_unused_columns = time.time()
    if remove_Unused_columns:
        
        for column in dataset.column_names:
            if column not in data_field:
                if debug: print(f"{column}")
                dataset.remove_columns_(column)

    if time_debug: end_remove_unused_columns = time.time()
    if time_debug: print(f"[TIME] remove.column: {end_remove_unused_columns - start_remove_unused_columns}")
        
    ## ------------------ ##
    ## --- SPLIT 1.    -- ##
    ## ------------------ ##
    if time_debug: start_first_split = time.time()
    
    # 80% (train), 20% (test + validation)
    ## execution
    train_testvalid = dataset.train_test_split(test_size=0.2, seed=SEED)
    
    if time_debug: end_first_split = time.time()
    if time_debug: print(f"[TIME] first [train-(test-val)] split: {end_first_split - start_first_split}")

    ## ------------------ ##
    ## --- SPLIT 2.    -- ##
    ## ------------------ ##
    if time_debug: start_second_split = time.time()
    
    # 10% of total (test), 10% of total (validation)
    ## execution
    test_valid = train_testvalid["test"].train_test_split(test_size=0.5, seed=SEED)

    if time_debug: end_second_split = time.time()
    if time_debug: print(f"[TIME] second [test-val] split: {end_second_split - start_second_split}")

    ## execution
    dataset = DatasetDict({"train": train_testvalid["train"],
                          "test": test_valid["test"],
                          "valid": test_valid["train"]})
    if time_debug: end = time.time()
    if time_debug: print(f"[TIME] TOTAL: {end - start}") 
    return dataset

In [None]:
%time

# tokenizer.add_special_tokens({"eos_token": "[EOS]"})
DATA_FIELD =  ["title", "abstract"]

# here we use meta_s2orc for speed, 
dataset = partial_prepare_data(meta_s2orc, tokenizer, data_field=DATA_FIELD, max_seq_length=max_seq_length)

In [None]:
print(dataset)

In [None]:
def preprocess(*sentences_by_column, data, target): #-> Dict[str, Union[list, Tensor]]:
    """Preprocess the raw input sentences from the text file.
    :param sentences: a list of sentences (strings)
    :return: a dictionary of "input_ids"
    """
    print_all_debug = False
    time_debug = False
    print_some_debug = False

    if debug: print(f"[INFO-START] Preprocess on data: {data}, target: {target}") 
    
    assert data == ['abstract'], "data should be ['abstract']"
    if debug: print(data)
    assert target == ['title'], "target should be ['title']"
    if debug: print(target)
        
    data_columns_len = len(data)
    target_columns_len = len(target)
    columns_len = data_columns_len + target_columns_len
    
    assert data_columns_len == 1, "data length should be 1"
    if debug: print(data_columns_len)
    assert target_columns_len == 1, "target length should be 1"
    if debug: print(target_columns_len)
        
    sentences_by_column = np.asarray(sentences_by_column)
    input_columns_len = len(sentences_by_column)
    
    if debug: print(f'all sentences (len {input_columns_len}): {sentences_by_column}')
    
    if target_columns_len == 0:
        raise NameError("No target variable selected, \
                    are you sure you don't want any target?")
        
    data_sentences = sentences_by_column[0]
    target_sentences = sentences_by_column[1] # if columns_len == input_columns_len else sentences_by_column[data_columns_len:-1]
    
    if debug: print(data_sentences)
    if debug: print(target_sentences)

    """
    # clean input removing **None**, converting them to **''**
    if clean_None_data:
        data_sentences = np.asarray([ s if s is not None else '' for s in data_sentences])
        target_sentences = np.asarray([ s if s is not None else '' for s in target_sentences])

    # clean input removing papers with **None** as abstract/title
    elif remove_None_data:
        none_data_indexes = np.asarray([ idx_s for idx_s, s in enumerate(data_sentences) if s is None])
        none_target_indexes = np.asarray([ idx_s for idx_s, s in enumerate(target_sentences) if s is None])

        if debug: print(none_data_indexes)
        if debug: print(none_target_indexes)

        to_removed_indexes = np.unique(none_data_indexes, none_target_indexes)

        if debug: print(to_removed_indexes)

        data_sentences = np.delete(data_sentences, to_removed_indexes)
        target_sentences = np.delete(target_sentences, to_removed_indexes)
    
    if debug: print(data_sentences)
    if debug: print(target_sentences)
    """
    
    # sentences = [s for s in sentences if s is not None]
    # tokens = [s.strip().split() for s in sentences]
    # tokens = [t[:max_seq_length - 1] + [tokenizer.eos_token] for t in tokens]

    # The sequences are not padded here. we leave that to the dataloader in a collate_fn
    # ----------------------------------------------- #
    # -------- TODO include the `collate_fn` -------- #
    # ----------------------------------------------- #
    # That means: a bit slower processing, but a smaller saved dataset size
    if print_some_debug: print(max_seq_length)
        
    data_encoded_d = tokenizer(
                        text=data_sentences.tolist(),
                        # add_special_tokens=False,
                        # is_pretokenized=True,
                        padding=True, truncation=True, max_length=max_seq_length,
                        return_token_type_ids=False,
                        return_attention_mask=False,
                        # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it
                        # receives the `special_tokens_mask`.
                        return_special_tokens_mask=True,
                        return_tensors='np'
    )
    
    target_encoded_d = tokenizer(
                        text=target_sentences.tolist(),
                        # add_special_tokens=False,
                        # is_pretokenized=True,
                        padding=True, truncation=True, max_length=max_seq_length,
                        return_token_type_ids=False,
                        return_attention_mask=False,
                        # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it
                        # receives the `special_tokens_mask`.
                        return_special_tokens_mask=True,
                        return_tensors='np'
    )

                            

    if debug: print(data_encoded_d["input_ids"].shape)
    if debug: print(target_encoded_d["input_ids"].shape)
    # return encoded_d
    
    return {"data_input_ids": data_encoded_d["input_ids"], "target_input_ids": target_encoded_d["input_ids"]}
    # return {"input_ids": sum(encoded_d['input_ids'], [])} 

print an example
```python 
print(dataset['train'][:10]['title'], dataset['train'][:10]['abstract'])
```

In [None]:
vocab = tokenizer.get_vocab()
print(f"[PAD]: {vocab['[PAD]']}")
print(f"[UNK]: {vocab['[UNK]']}")
print(f"[SEP]: {vocab['[SEP]']}")
print(f"[CLS]: {vocab['[CLS]']}")
print(f"0: {tokenizer.convert_ids_to_tokens(0)}")
print(f"1: {tokenizer.convert_ids_to_tokens(1)}")
print(f"2: {tokenizer.convert_ids_to_tokens(2)}")
print(f"99: {tokenizer.convert_ids_to_tokens(99)}")
print(f"100: {tokenizer.convert_ids_to_tokens(100)}")
print(f"101: {tokenizer.convert_ids_to_tokens(101)}")

In [None]:
tokenizer

Finally, I found [this](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=datasetdict#datasets.DatasetDict.map) documentation for the function `DatasetDict.map` from the `dataset` library.

In [None]:
dictionary_input = { "data": ["abstract"], "target": ["title"], "class": ["mag_field_of_study"]}
dictionary_columns = sum(dictionary_input.values(), [])
dataset_map = dataset.map(preprocess, input_columns= dictionary_columns, fn_kwargs= dictionary_input, batched=True)

In [None]:
dataset_map

In [None]:
dataset_map = dataset_map.rename_column("data_input_ids", "input_ids")

In [None]:
dataset_map.set_format("torch", columns=["input_ids"])

In [None]:
print(dataset_map['train'][1]['input_ids'].size())

In [None]:
%store dataset_map

```python
def pad_collate(batch):
    batch = np.asarray(batch)
    
    xx, yy = [], []
    for elem in batch:
        values = list(elem.values())
        
        x, y = values
        xx.append(x)
        yy.append(y)
    
    
    # xx, yy = np.transpose([list(elem.values()) for elem in batch])
    if debug: print(xx, yy)
    
    x_lens = [len(x) for x in xx]
    y_lens = [len(y) for y in yy]
    if debug: print(x_lens, y_lens)
    
    xx_pad = pad_sequence(xx, batch_first=True, padding_value=0)
    yy_pad = pad_sequence(yy, batch_first=True, padding_value=0)

    return xx_pad, yy_pad, x_lens, y_lens
    # return 1,2,3,4
```

```python
dataset_map['test']['target_input_ids']
```

```python
LIST_DICT = [
    {'data_input_ids': [  
          106,  3531,  3253,   165,  1151,   214,  9178,   140,   111,  8830, 
        16091, 30113,   190,   106,  3892, 13527, 1281,  1814,   256,   165,
         3568,   191,   130,  3081,  2936,  5796,   190,   111,  2279,   131,
          130,  4548,  5859,   205,   147,   535,  1313,   131,   111,  3551,
         1171, 12062,   988,   131,   111, 23137,   137,  1346, 11250,   131,
         8830, 16091, 30113,   205,   111,  8064,   137, 19876,  1382,   146,
          145,   124,   422,  2194,   546,   165,  2030,   263,   111,  2027,
          131,   111,  3081,  2936,  3551,   205,   185, 27618,   405,   111,
          705,  2484,  1151,   263,  5197,  7423,   131,  5157,   205], 
     'target_input_ids': [  
          130,  3081, 13204,   168,  5373,   655,  8064,   137,  1333,  4620,
          131,  8830,  5035, 28067, 30118,  9365]
    },
    {
        'data_input_ids': [], 
        'target_input_ids': [ 
         7831,   131, 11536,  1630,  4416,  2474,  4127,   235,  4353,  1352,
         2329,   579,  3274,   579, 20356,   579,  3967,   579, 15969,  3396]
    }
]
```

```python
pad_collate(np.asarray(LIST_DICT))
```

Create a DataLoader. (unuseful for the following procedures)
```python
dataset_result = {partition: DataLoader(ds,
                                 batch_size=64,
                                 shuffle=True,
                                 num_workers=4,
                                 collate_fn=pad_collate,
                                 pin_memory=True) for partition, ds in dataset_map.items()}
```

Testing the dataloader.
```python
for data_, target_, data_len, target_len in dataset_result['train']:  
    print(f"data_len      : {data_len}\n")
    print(f"data_         : {data_}\n")
    print(f"data_.shape   : {data_.shape}\n\n")
    
    print(f"target_len    : {target_len}\n")
    print(f"target_       : {target_}\n")
    print(f"target_.shape : {target_.shape}\n\n")
```

---
---
## ❌ FAKE PIPELINE for train BERT-based NETS
---
---

Testing the dataset_map elements.
```python
for data_ in dataset_map['train']:  
    print(f"data_        : {data_['input_ids'].size()}\n")
```

In [1]:
%store -r dataset_map

In [2]:
dataset_map

DatasetDict({
    train: Dataset({
        features: ['abstract', 'input_ids', 'target_input_ids', 'title'],
        num_rows: 612900
    })
    test: Dataset({
        features: ['abstract', 'input_ids', 'target_input_ids', 'title'],
        num_rows: 76613
    })
    valid: Dataset({
        features: ['abstract', 'input_ids', 'target_input_ids', 'title'],
        num_rows: 76613
    })
})

In [14]:
dataset_map.set_format("torch", columns=["input_ids", "target_input_ids"])

In [16]:
#tokenizer = AutoTokenizer.from_pretrained() 
#model = BertForMaskedLM.from_pretrained() 
dataset_map['train'][0]['target_input_ids']

tensor([  102,  6935,   669,   131,  3101,  1262,   121,  3838,   131,  1505,
          573, 11996, 30110,   579,  3641,  1146,  2385, 12006,  4566,   121,
        14817,   103,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0])

From [here](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py) you can get an idea from were the code has been borrowed.

In [15]:
# Data collator
# This one will take care of randomly masking the tokens.
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

train_dataset = dataset_map['train']
eval_dataset = dataset_map['valid']

# Inizialize TrainerArguments
training_args = TrainingArguments(
    output_dir=output_dir,           # [def.`tmp_trainer`] output directory
    num_train_epochs=3,              # [def.   3 ] total # of training epochs
    per_device_train_batch_size=64,  # [def.   8 ] batch size per device during training
    per_device_eval_batch_size=64,   # [def.   8 ] batch size for evaluation
    evaluation_strategy="steps",     # [def. 'no'] evaluation is done (and logged) every eval_steps
    warmup_steps=500,                # [def.   0 ] number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # [def.   0 ] strength of weight decay 
    learning_rate=5e-5,              # [def. 5e-5] 
    logging_dir='./logs',            # [def. runs/__id__] directory for storing logs. TensorBoard log directory.
)

# Initialize our Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.

PyTorch: setting up devices
No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: abstract, target_input_ids, title.
The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: abstract, target_input_ids, title.


<transformers.trainer.Trainer at 0x7f137d23e790>

In [None]:
# Training
train_result = trainer.train(checkpoint)
trainer.save_model()  # Saves the tokenizer too for easy upload
metrics = train_result.metrics

max_train_samples = len(train_dataset)
metrics["train_samples"] = min(max_train_samples, len(train_dataset))

trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()



In [None]:
# Evaluation

#logger.info
print("*** Evaluate ***")

metrics = trainer.evaluate()

max_val_samples = len(eval_dataset)
metrics["eval_samples"] = min(max_val_samples, len(eval_dataset))
perplexity = math.exp(metrics["eval_loss"])
metrics["perplexity"] = perplexity

trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

In [None]:
!mv 

---
## 1. Introduction
---

The following datasets were downloaded from the internet (we try to provide links to those we have the right to do so). We divide the dataset based on the task they are mostly used for.

### 1.1 Keyphrase task
---

SOTA: [keyphrase generation](https://arxiv.org/pdf/1704.06879.pdf).

The Keyphrase datasets (***duc***, ***Inspect***, ***Krapivin***, ***NUS***, ***SemEval-2010***, ***KP20k dataset***, ***MagKP-CS***) are structured as follow:

- title
- abstract
- fulltext
- keywords

The only dataset that variates is ***STACKEX*** that instead of having *abstract* and *keywords* has:

- question (abstract)
- tags (keywords)

Here there is a list of the datasets previously cited, with some information:

- **duc**, we haven't had much information on this dataset untill now.

- **Inspec** [(Hulth, 2003)](https://www.aclweb.org/anthology/W03-1028.pdf), This dataset provides *2,000 paper abstracts*. We adopt the *500 testing* papers and their corresponding uncontrolled keyphrases for evaluation, and the remaining *1,500 papers* are used for *training* the supervised baseline models.

- **Krapivin** [(Krapivin et al., 2008)](http://eprints.biblio.unitn.it/1671/1/disi09055-krapivin-autayeu-marchese.pdf): This dataset provides *2,304 papers with full-text* and *author-assigned keyphrases*. However, the author did not mention how to split testing data, so we selected the first *400 papers in alphabetical order as the testing data*, and the *remaining* papers are used to *train* the supervised baselines.

- **NUS** [(Nguyen and Kan, 2007)](https://www.comp.nus.edu.sg/~kanmy/papers/icadl2007.pdf): We use both author-assigned and reader-assigned keyphrases and treat *all 211 papers as the testing data*. Since the NUS dataset did not specifically mention the ways of splitting training and testing data, the results of the supervised baseline models are obtained through a *five-fold cross-validation*.

- **SemEval-2010** [(Kim et al., 2010)](https://www.aclweb.org/anthology/S10-1004.pdf): 288 articles were collected from the ACM Digital Library. 100 articles were used for testing and the rest were used for training supervised baselines.

- **KP20k dataset** [(Meng et al., 2018)](https://arxiv.org/abs/1704.06879): They built a new testing dataset that contains the *titles, abstracts, and keyphrases* of *20,000 scientific articles* in computer science. They were *randomly selected from their obtained 567,830 articles*. Thus they took the 20,000 articles in the validation set to train the supervised baselines.

- **MagKP-CS** (from OpenNMT-py and [OpenNMT-kpg-release](https://github.com/memray/OpenNMT-kpg-release)) that is available for download. 

- **STACKEX** (from [StackExchange](https://archive.org/details/stackexchange)) has been constructed from the computer science forums (CS/AI) at StackExchange using “title” + “body” as source text and “tags” as the target keyphrases. After removing questions without valid tags, they collected 330,965 questions. They randomly selected *16,000 for validation*, and another *16,000 as test set*. Note some questions in StackExchange forums contain large blocks of code, resulting in long texts (sometimes more than 10,000 tokens after tokenization), this is difficult for most neural models to handle. Consequently, the texts have been truncated to 300 tokens and 1,000 tokens for training and evaluation splits respectively.

###### ⚠️ATTENTION
> As we aren't going to use the Keyphrase dataset for now, we don't need any custom classes for managing this dataset. We will implement this functions and classes as we go, if there will be the needs.

### 1.2 Sentence embedding task
---

SOTA: [sBERT](https://arxiv.org/abs/1908.10084)

- **SNLI** [(Bowman et al., 2015)](https://arxiv.org/abs/1508.05326) is a collection of *570,000 sentence pairs* annotated with the *labels contradiction, eintailment, and neutral*.

- **MultiNLI** [(Williams et al., 2018)](https://arxiv.org/abs/1704.05426) contains *430,000 sentence pairs* and covers a *range of genres of spoken and written text*.

- **SciTail** [(allenai)](http://ai2-website.s3.amazonaws.com/publications/scitail-aaai-2018_cameraready.pdf), the entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced but created from sentences that already exist “in the wild”. *Hypotheses* were created from *science questions* and the corresponding *answer candidates*, while relevant web sentences from a large corpus were used as premises. Models are evaluated based on accuracy.

###### ❌ATTENTION
> As we aren't going to use the NLI tasks dataset (for now), we don't need any custom classes for managing this dataset. We will implement this functions and classes as we go, if there will be the needs.

### 1.3 Generic NLP tasks
---

- **S2ORC** [(Lo et al., 2020)](https://github.com/allenai/s2orc) is a large corpus of *81.1M English-language academic papers* spanning many academic disciplines. The corpus consists of *rich metadata, paper abstracts, resolved bibliographic references*, as well as *structured full text for 8.1M open access papers*. Full text is annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. In S2ORC, they aggregate papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date. Built for text mining over academic text.

- **OAG** [(Tang et al., 2008)](http://keg.cs.tsinghua.edu.cn/jietang/publications/KDD08-Tang-et-al-ArnetMiner.pdf)  is a large knowledge graph unifying *two billion-scale academic graphs*: Microsoft Academic Graph (**MAG**) and **AMiner**. In mid 2017, they published OAG v1, which contains *166,192,182 papers from MAG and 154,771,162 papers from AMiner* and generated *64,639,608 linking (matching) relations between the two graphs*. This time, in OAG v2, author, venue and newer publication data and the corresponding matchings are available.



###### ✅ATTENTION
> We are going to use the S2ORC dataset as it contains full_text data as well as citation/reference informations. It contains also authorship - title - tables data that we will describe below.

---
# 2. S2ORC
---

## 2.1 Description (s2orc)
---
The `S2ORC` dataset is in the `data` path under the folder `s2orc-full-20200705v1` (where `s2orc` is the name of the dataset, `full` is the type, as there is also a sample fingerprint; and `20200705v1` is the version). 
We can reach the data folder exiting by the project and entering in the data folder:

In [None]:
DATA_PATH = '/home/vivoli/Thesis/data' 
!ls $DATA_PATH

In [None]:
custom_path = f"{DATA_PATH}/s2orc-full-20200705v1/full"
!ls $custom_path

As you can see (going into `s2orc-full-20200705v1/full/`) there are the `metadata` folder and the `pdf_parses` folder. The main difference (as we can already get it from the names) is that in the `metadata` you only have some information about the dataset (retrieved from the published metadata), while in the `pdf_parses` you get all the extensive data conteined in the paper (if the paper was present, was correctly parsed and no restriction in the paper data were applied due to limited licence permition). For some reason, the `title` of the paper is contained only in the `metadata` file, but it can get from the `paper_id` field of the paper itself.

More information about the `S2ORC` dataset can be read in the [README.md](https://github.com/allenai/s2orc/blob/master/README.md) of the project and in the [project repository](https://github.com/allenai/s2orc/)

### mag field
- MAG fields of study:

| class | Field of study | All papers | Full text |
|-------|----------------|------------|-----------|
|0      | Medicine       | 12.8M      | 1.8M      |
|1      | Biology        | 9.6M       | 1.6M      |
|2      | Chemistry      | 8.7M       | 484k      |
|3      | n/a            | 7.7M       | 583k      |
|4      | Engineering    | 6.3M       | 228k      |
|5      | Comp Sci       | 6.0M       | 580k      |
|6      | Physics        | 4.9M       | 838k      |
|7      | Mat Sci        | 4.6M       | 213k      |
|8      | Math           | 3.9M       | 669k      |
|9      | Psychology     | 3.4M       | 316k      |
|10     | Economics      | 2.3M       | 198k      |
|11     | Poli Sci       | 1.8M       | 69k       |
|12     | Business       | 1.8M       | 94k       |
|13     | Geology        | 1.8M       | 115k      |
|14     | Sociology      | 1.6M       | 93k       |
|15     | Geography      | 1.4M       | 58k       |
|16     | Env Sci        | 766k       | 52k       |
|17     | Art            | 700k       | 16k       |
|18     | History        | 690k       | 22k       |
|19     | Philosophy     | 384k       | 15k       |

We need now a function that reads all the lines of the `jsonl` files inside both `metadata` and `pdf_parses` folders. Then we'll 

## `metadata` schema

We recommend everyone work with `metadata/` as the starting point.  This is a JSONlines file (one line per paper) with the following keys:

#### Identifier fields

* `paper_id`: a `str`-valued field that is a unique identifier for each S2ORC paper.

* `arxiv_id`: a `str`-valued field for papers on [arXiv.org](https://arxiv.org).

* `acl_id`: a `str`-valued field for papers on [the ACL Anthology](https://www.aclweb.org/anthology/).

* `pmc_id`: a `str`-valued field for papers on [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/articles).

* `pubmed_id`: a `str`-valued field for papers on [PubMed](https://pubmed.ncbi.nlm.nih.gov/), which includes MEDLINE.  Also known as `pmid` on PubMed.

* `mag_id`: a `str`-valued field for papers on [Microsoft Academic](https://academic.microsoft.com).

* `doi`: a `str`-valued field for the [DOI](http://doi.org/).  

Notably:

* Resolved citation links are represented by the cited paper's `paper_id`.

* The `paper_id` resolves to a Semantic Scholar paper page, which can be verified using the `s2_url` field.

* We don't always have a value for every identifier field.  When missing, they take `null` value.


#### Metadata fields

* `title`: a `str`-valued field for the paper title.  Every S2ORC paper *must* have one, though the source can be from publishers or parsed from PDFs.  We prioritize publisher-provided values over parsed values.

* `authors`: a `List[Dict]`-valued field for the paper authors.  Authors are listed in order.  Each dictionary has the keys `first`, `middle`, `last`, and `suffix` for the author name, which are all `str`-valued with exception of `middle`, which is a `List[str]`-valued field.  Every S2ORC paper *must* have at least one author.

* `venue` and `journal`: `str`-valued fields for the published venue/journal.  *Please note that there is not often agreement as to what constitutes a "venue" versus a "journal". Consolidating these fields is being considered for future releases.*   

* `year`: an `int`-valued field for the published year.  If a paper is preprinted in 2019 but published in 2020, we try to ensure the `venue/journal` and `year` fields agree & prefer non-preprint published info. *We know this decision prohibits certain types of analysis like comparing preprint & published versions of a paper.  We're looking into it for future releases.*  

* `abstract`: a `str`-valued field for the abstract.  These are provided directly from gold sources (not parsed from PDFs).  We preserve newline breaks in structured abstracts, which are common in medical papers, by denoting breaks with `':::'`.     

* `inbound_citations`: a `List[str]`-valued field containing `paper_id` of other S2ORC papers that cite the current paper.  *Currently derived from PDF-parsed bibliographies, but may have gold sources in the future.*

* `outbound_citations`: a `List[str]`-valued field containing `paper_id` of other S2ORC papers that the current paper cites.  Same note as above.   

* `has_inbound_citations`: a `bool`-valued field that is `true` if `inbound_citations` has at least one entry, and `false` otherwise.

* `has_outbound_citations` a `bool`-valued field that is `true` if `outbound_citations` has at least one entry, and `false` otherwise.

We don't always have a value for every metadata field.  When missing, `str` fields take `null` value, while `List` fields are empty lists.

#### PDF parse-related metadata fields

* `has_pdf_parse`:  a `bool`-valued field that is `true` if this paper has a corresponding entry in `pdf_parses/`, which means we had processed that paper's PDF(s) at some point.  The field is `false` otherwise.

* `has_pdf_parsed_abstract`: a `bool`-valued field that is `true` if the paper's PDF parse contains a parsed abstract, and `false` otherwise.   

* `has_pdf_parsed_body_text`: a `bool`-valued field that is `true` if the paper's PDF parse contains parsed body text, and `false` otherwise.

* `has_pdf_parsed_bib_entries`: a `bool`-valued field that is `true` if the paper's PDF parse contains parsed bibliography entries, and `false` otherwise.

* `has_pdf_parsed_ref_entries`: a `bool`-valued field that is `true` if the paper's PDF parse contains parsed reference entries (e.g. tables, figures), and `false` otherwise.

Please note:

* If `has_pdf_parse = false`, the other four fields will not be present in the JSON (trivially `false`).

* If `has_pdf_parse = true` but `has_pdf_parsed_abstract`, `has_pdf_parsed_body_text`, or `has_pdf_parsed_ref_entries` are `false`, this can be because:

    * Our PDF parser failed to extract that element
    * Our PDF parser succeeded but that paper simply did not have that element (e.g. papers without abstracts)
    * Our PDF parser succeeded but that element was removed because the paper is not identified as open-access.  


##### metadata_CLASS
```python
{
 "paper_id": (string), 
 "title": (string), 
 "authors": [
     {
         "first": (string), 
         "middle": [], 
         "last": (string), 
         "suffix": (string)
     },
     ...
   ]: **Author_Class**, 
 "abstract": (string), 
 "year": (int), 
 "arxiv_id": null, 
 "acl_id": null, 
 "pmc_id": null, 
 "pubmed_id": null, 
 "doi": null, 
 "venue": null, 
 "journal": (string), 
 "mag_id": (string-number), 
 "mag_field_of_study": [
     "Medicine",
     "Computer Science"
   ]: **FieldOfStudy_Enum**, 
 "outbound_citations": [], 
 "inbound_citations": [], 
 "has_outbound_citations": false, 
 "has_inbound_citations": false, 
 "has_pdf_parse": false, 
 "s2_url": (string)
}
```

Here I represent Author_Class as an object of 
```python
{
    "first": (string), 
    "middle": [], 
    "last": (string), 
    "suffix": (string)
}
```
and `FieldOfStudy_Enum` as an Enum of string such as `[ "Medicine", "Computer Science", "Physics", "Mathematics", ... ]`


## `pdf_parses` schema

We view `pdf_parses/` as supplementary to the `metadata/` entries.  PDF parses are also represented as JSONlines file (one line per paper) with the following keys:

* `paper_id`: a `str`-valued field which is the same S2ORC paper ID in `metadata/`

* `_pdf_hash`: a `str`-valued field.  Internal usage only.  We use this for debugging.

* `abstract` and `body_text` are `List[Dict]`-valued fields representing parsed text from the PDF.  Each `Dict` corresponds to a paragraph.  `List` preserves their original ordering.

* `bib_entries` and `ref_entries` are `Dict`-valued fields representing extracted entities that can be referenced (inline) within the text.

#### example 1

One example paragraph in `abstract` or `body_text` might look like:

```python
{
    "section": "Introduction",
    "text": "Dogs are happier cats [13, 15]. See Figure 3 for a diagram.",
    "cite_spans": [
        {"start": 22, "end": 25, "text": "[13", "ref_id": "BIBREF11"},
        {"start": 27, "end": 30, "text": "15]", "ref_id": "BIBREF30"},
        ...
    ],
    "ref_spans": [
        {"start": 36, "end": 44, "text": "Figure 3", "ref_id": "FIGREF2"},
    ]
}
```

and example `bib_entries` and `ref_entries` might look like:

```python
{
    ...,
    "BIBREF11": {
        "title": "Do dogs dream of electric humans?",
        "authors": [
            {"first": "Lucy", "middle": ["Lu"], "last": "Wang", "suffix": ""}, 
            {"first": "Mark", "middle": [], "last": "Neumann", "suffix": "V"}
        ],
        "year": "", 
        "venue": "barXiv",
        "link": null
    },
    ...
}
```

```python
{
    "TABREF4": {
        "text": "Table 5. Clearly, we achieve SOTA here or something.",
        "type": "table"
    }
    ...,
    "FIGREF2": {
        "text": "Figure 3. This is the caption of a pretty figure.",
        "type": "figure"
    },
    ...
}
```

Notice: 

* Inline `spans` are represented by character start and end indices into the paragraph `text`
* `spans` resolve to `BIBREF`, `TABREF` or `FIGREF` entries.
* `BIBREF` are IDs of bibliographic elements of `bib_entries`.  Bib entries may be missing fields (e.g. `year`).  They can be linked to S2ORC papers, as specified by `link`, but we also preserve any unlinked entries by setting `link` to `null`.
* `FIGREF` and `TABREF` are IDs of figure and table elements of `ref_entries`.  Ref entries contain the caption text of the corresponding object, and also indicate the type of object.


#### example 2

You may see empty `pdf_parses/` JSONs that look like: 

```python
{
    "paper_id": "...", 
    "_pdf_hash": "...", 
    "abstract": [], 
    "body_text": [], 
    "bib_entries": {}, 
    "ref_entries": {}
}
```

We keep these around for our internal usage, but the way to interpret these is that there is no usable PDF parse here, despite the corresponding `metadata/` entry still displaying `has_pdf_parse = true`.

These exist when (i) `bib_entries` does not successfully parse *and* (ii) the paper is not open-access, so we had to remove `abstract`, `body_text`, and `ref_entries`.   



##### pdf_parses_CLASS
```python
{
 "paper_id": (string), 
 "_pdf_hash": (string-number), 
 "abstract": [
     {
         "section": (string) "Abstract", 
         "text": (string), 
         "cite_spans": [
             {
                 "start": (int), 
                 "end": (int), 
                 "text": (string-number) "[4, 
                 "ref_id": (string)
             }
           ]: **CiteSpan_Class**, 
         "ref_spans": []
     },
     ...
 ]: **TextSection_Class**, 
 "body_text": [], 
 "bib_entries": 
     {
         "BIBREF0": 
             {
              "title": (string), 
              "authors": [
                  {
                      "first": (string), 
                      "middle": [], 
                      "last": (string), 
                      "suffix": (string)
                   }
                 ], 
               "year": (int), 
               "venue": (string), 
               "link": (string-number)
              }, 
          "BIBREF1": 
              {
                  ...
              }
       }: **BIBREF_Class**, 
 "ref_entries": {}
}
```

Here I represent `TextSection_Class` as an object of 
```python
{
 "section": (string), 
 "text": (string), 
 "cite_spans": [
     {
         "start": (int), 
         "end": (int), 
         "text": (string-number) "[4, 
         "ref_id": (string)
     }
   ], 
 "ref_spans": []
}
```
where `CiteSpan_Class` itself is another structured object:
```python
{
 "start": (int), 
 "end": (int), 
 "text": (string-number), 
 "ref_id": (string)
}
```
and `BIBREF_Class` as dictionary field with `BIBREF_#` as key and related to it an object as follow:
```python
"BIBREF_#": 
 {
  "title": (string), 
  "authors": [
      {
          "first": (string), 
          "middle": [] (list of string),
          "last": (string), 
          "suffix": (string)
       }
     ], 
   "year": (int), 
   "venue": (string), 
   "link": null
  }
```

## 2.2 Creation (s2orc)
---
Now we have explored the `S2ORC` structure, we are ready to load the data (starting from the `sample` and following on the `full` folder). The first thing to do is create (as we did before) a method for read the json: `json_s2orc_read`.

In [None]:
TYPE = "sample" # "full"
SAMPLE_FOLDER = f"s2orc-{TYPE}-20200705v1/{TYPE}"

# sample data
sample_file_names = [
    "sample.jsonl"
]

# full data
N = 1
full_file_names = [ f"metadata_{index}.jsonl" for index in range(0,N) ]

print('sample', sample_file_names)
print('full', full_file_names)

In [None]:
# choosing the file_names to work with
if TYPE == "sample": 
    file_names = sample_file_names
else: 
    file_names = full_file_names
    
print(file_names)

Lets's see what's inside the folder (in this case `metadata` but should be the same for `pdf_parses`:

In [None]:
DATASET_PATH = f"{DATA_PATH}/{SAMPLE_FOLDER}"

In [None]:
path_metadata = f"{DATA_PATH}/{SAMPLE_FOLDER}/metadata/"
path_pdf_parses = f"{DATA_PATH}/{SAMPLE_FOLDER}/pdf_parses/"

metadata_output = !ls $path_metadata | grep ".*\.jsonl$"
print('metadata:', metadata_output)

pdf_parses_output = !ls $path_pdf_parses | grep ".*\.jsonl$"
print('pdf_parses', pdf_parses_output)

So we can describe the function in charge to load the `jsonl` files. The function must have in input the `generic_path` (f"{DATA_PATH}/{SAMPLE_FOLDER}") and then searching in `metadata` and `pdf_parses` for the files present in `file_names`.

(Unused)
```python
def read_json_list(jsonl_path):
    json_list_of_dict = []
    with open(jsonl_path, 'r') as input_json:
        for json_line in input_json:
            json_dict = json.loads(json_line)
            json_list_of_dict.append(json_dict)
    return json_list_of_dict
```

In [None]:
def read_json_list_dict(jsonl_path):
    # list of dictionaries, one for each row in pdf_parses
    json_list_of_dict = []
    # dictionary of indexes, to obtain the object from list, starting from the `paper_id`
    json_dict_of_index = {}
    with open(jsonl_path, 'r') as input_json:
        for index, json_line in enumerate(input_json):
            json_dict = json.loads(json_line)
            # append the dictionary to the dictionaries' list
            json_list_of_dict.append(json_dict)
            # insert (paper_id, index) pair as (key, value) to the dictionary
            json_dict_of_index[json_dict['paper_id']] = index
    return json_list_of_dict, json_dict_of_index

In [None]:
def json_s2orc_chunk_read( dataset_path, file_name ):
    """
    Args:
        dataset_path (string): Path to the Dataset directory (es. '../data/s2orc-sample-20200705v1/sample').
        file_name (string): Name of the file to read (es. 'sample.jsonl').
            
    Return:
        json_dict (list of dict): Dictionary such as: 
                { 'metadata': [...], 'pdf_parses': [...] }
            with objects of type metadata_CLASS and pdf_parses_CLASS respectively
    """
    if verbose: print("[INFO-START] Chunk read: ", file_name)

    json_dict_of_list = {'metadata': [], 'pdf_parses': [], 'meta_key_idx': {}, 'pdf_key_idx': {}}
    
    if verbose: print("[INFO] Metadata read: ", file_name)
    jsonl_path_metadata = os.path.join(dataset_path, 'metadata', file_name)
    json_list_metadata, json_dict_of_index_meta = read_json_list_dict(jsonl_path_metadata)
    json_dict_of_list['metadata'] = json_list_metadata
    json_dict_of_list['meta_key_idx'] = json_dict_of_index_meta
    
    if verbose: print("[INFO] Pdf_Parses read: ", file_name)
    jsonl_path_pdf_parses = os.path.join(dataset_path, 'pdf_parses', file_name)
    json_list_pdf_parses, json_dict_of_index_pdf = read_json_list_dict(jsonl_path_pdf_parses)
    json_dict_of_list['pdf_parses'] = json_list_pdf_parses
    json_dict_of_list['pdf_key_idx'] = json_dict_of_index_pdf
    
    if verbose: print("[INFO-END  ] Chunk read: ", file_name)                 
    return json_dict_of_list

In [None]:
def json_s2orc_multichunk_read( dataset_path, file_names ):
    """
    Args:
        dataset_path (string): Path to the Dataset directory (es. '{data}/s2orc-sample-20200705v1/sample').
        file_names (list of string): List of filenames (es. ['sample_0.jsonl', 'sample_1.jsonl'])
            present in `{dataset_path}/metadata` and `{dataset_path}/pdf_parses`.
    
    """
    if verbose: print("[INFO-START] Multichunk read")
    
    multichunks_lists = []
    for file_name in file_names:

        chunk_list = json_s2orc_chunk_read( dataset_path, file_name )
        multichunks_lists.append(chunk_list)
    
    return multichunks_lists

In [None]:
print(DATASET_PATH)
print(file_names)

In [None]:
multichunks_lists = json_s2orc_multichunk_read( DATASET_PATH,file_names )

We have used only the `sample.jsonl` or the pair (`metadata_0.jsonl`-`pdf_parses_0.jsonl`) so we just have one element in the `multichunks_lists`. 

In [None]:
len(multichunks_lists)

We have parses all the `metadata` and `pdf_parses` elements, so we have now a dictionary that is composed by:
```python
json_dict_of_list = {
    'metadata': [], 
    'pdf_parses': {}, 
    'meta_key_idx': {}, 
    'pdf_key_idx': {}
}
```
In this dictionary we see:
* metadata - `List[dict]` of type `metadata`.
* pdf_parses - `List[dict]` of type `pdf_parses`.
* meta_key_idx - `dict` with keys: `paper_id` and values: `index` in the metadata list.
* pdf_key_idx - `dict` with keys: `paper_id` and values: `index` in the pdf_parses list.

In [None]:
index = multichunks_lists[0]['meta_key_idx']['77499681']
multichunks_lists[0]['metadata'][index]

In [None]:
index = multichunks_lists[0]['pdf_key_idx']['77499681']
print(multichunks_lists[0]['pdf_parses'][index]['paper_id'])

## 2.3 Title Abstract - Full text  (s2orc)
---
We have loaded the `S2ORC` dataset, created our (one chunk) dataset parses and we want now starting creating our dataset objects (Classes and Loaders).

Let's start with the datasets.

### Dataset creation
We want to create the datasets for papers' title-abstract and fulltext-(title-abstract) generation. 
> we'd like also to create a KeyPhrase dataset, we are actualling waiting for the response from the `S2ORC` authors to understand where can we possibly obtain the keyphrases/keywords.

In order to do this, we want to create the two datasets (saving them as `jsonl` files).
We can organize the data folder as :
```bash
- data/
    # keyphrase dataset 
    - keyphrase/
        # (title - abstract - fulltext - keyphrase)
        - s2orc/
            - README.md
            - chuncks_dataset_idx.json
            - train/
                - train_0.jsonl
                - train_1.jsonl
                - ...
            - test/
                - test_0.jsonl
                - test_1.jsonl
                - ...
            - val/
                - val_0.jsonl
                - val_1.jsonl
                - ...
    
    # sts datasets
    - sts/ 
        # (title - abstract - cosine_similarity)
        - s2orc_partial/
            - README.md
            - chuncks_dataset_idx.json
            - train/
                - train_0.jsonl
                - train_1.jsonl
                - ...
            - test/
                - test_0.jsonl
                - test_1.jsonl
                - ...
            - val/
                - val_0.jsonl
                - val_1.jsonl
                - ...
                
        # (title - abstract - fulltext - cosine_similarity)
        - s2orc_full/
            - README.md
            - chuncks_dataset_idx.json
            - train/
                - train_0.jsonl
                - train_1.jsonl
                - ...
            - test/
                - test_0.jsonl
                - test_1.jsonl
                - ...
            - val/
                - val_0.jsonl
                - val_1.jsonl
                - ...
```
and in the `chuncks_dataset_idx.json` there is the dictionary that maps the `chuncks` (`metadata_{id}.jsonl, pdf_parses_{id}.jsonl for id in range(99)`) into the {train|test|validation}_{id}.

A first step to not-using chuncks (neither metadata nor fulltext) anymore is to summarize the data we want into a new python structure (dict) as follow, and save them 

```python
{
    "paper_id": (string-int), 
    "title":  (string),
    "abstract": (string), 
    "fulltext": (string), 
    "keywords": List[string],
}
```

1. get the training/validation dataset by extracting Title-Abstract from the `S2ORC` dataset, and getting the testing data from the `KeyPhrase` (*'inspec', 'krapivin', 'nus', 'semeval', 'kp20k', 'duc', 'stackexchange'*) datasets. We should have a pair of sentence (indicativelly a *title* and an *abstract*), possibly a *fulltext* and a *keywords* fields those can be

    - completelly related (abstract and its corresponding title)
    - someway related (abstract and a field-keyphrase related title {cs+(deep learning; metric learning; nlp; sts;)}
    - unrelated but not far away (abstract and a field-**not**keyphrase related title {cs+(nlp; transformer;)-vs-(cv; attention)}
    - completelly unrelated (abstract and title are field-keyphrase unrelated {cs+a -vs- phy+z})



2. **🤗transformers**, we can see [here](https://huggingface.co/docs/datasets/loading_datasets.html#json-files) the dataset loader (from `jsonl` files) can be used to load train/validation datasets. As we have alrerady load the dataset as dictionary (it is called `multichunks_lists` now, depending on how many chuncks we need to load in one shot) we could also be using the example [here](https://huggingface.co/docs/datasets/loading_datasets.html#from-a-python-dictionary) in order to load the dataset from an existing dictionary. 


1. **sentence-transformer**, [sBERT example for train](https://www.sbert.net/docs/training/overview.html#loss-functions) 

2. **🤗transformers**, we can see [here](https://huggingface.co/docs/datasets/loading_datasets.html#json-files) the dataset loader (from `jsonl` files) can be used to load train/validation datasets. As we have alrerady load the dataset as dictionary (it is called `multichunks_lists` now, depending on how many chuncks we need to load in one shot) we could also be using the example [here](https://huggingface.co/docs/datasets/loading_datasets.html#from-a-python-dictionary) in order to load the dataset from an existing dictionary. 