# Datasets Francesco
---

In this notebook we'll build/implement the Dataset classes we need to work with all the dataset we have.
First we will introduce the datasets, then we will separate those based on the usage we are going to make of them, then we will use/build/implement our classes in order to manage those different datasets and tasks.

# 0.0 Utils
---

We will be using the 🤗*Datasets* library, so first we need to install it.

> !pip install datasets > datasets_installation.txt

We'll be using also the 🤗 *Tranformers* library, as we need a tokenizer and a vocab.

> !pip install transformers > transformers_installation.txt

Let's define all the `imports` and `hyperparameters` in one place.

In [15]:
# Imports
import time
import os
import json
from transformers import AutoTokenizer, AutoModel, PreTrainedTokenizer, DataCollatorForLanguageModeling, Trainer, BertForMaskedLM, TrainingArguments
import torch
import transformers
from torch.utils.data import Dataset, DataLoader
from typing import Dict, List, Union
from datasets import load_dataset, DatasetDict
from enum import Enum
# from pydash.arrays import pull_at
import numpy as np
from torch.nn.utils.rnn import pad_sequence
import math

In [16]:
# ----------------------------------- #
#           Hyperparameters
# ----------------------------------- #
RUN_NAME = 'scibert-s2orc'
RUN_NUMBER = 2
RUN_ITER = 1


# --------- logging         --------- #
verbose = True # logging function description start and end
debug = False # logging element values
print_debug = False
time_debug = True

# to avoid warning messages from huggingface-tranformers you must comment this line
transformers.logging.set_verbosity_info()

# --------- preprocessing   --------- #
# in **partial_prepare_data**
remove_None_papers = True # if True, remove papers with None eather in abstract or title
remove_Unused_columns = True
# in **preprocess**
clean_None_data = False # if True, changes all the None (abstract of title) to ''
remove_None_data = False # if True (and clean_None_data set False), remove all the None abstract/title and the correspond title/abstract

# --------- paths           --------- #
# data folder path
data_base_dir = '/home/vivoli/Thesis/data'
# If you choose 'full' without setting N, the system will use the first chunk
s2orc_type = 'full'
N = None

# --------- model/tokenizer --------- #
# hugginface model/tokenizer name
MODEL_PATH = 'allenai/scibert_scivocab_uncased'
model_name_or_path = MODEL_PATH

# --------- checkpoint model -------- #
from transformers.trainer_utils import get_last_checkpoint

# seed for reproducibility of experiments
SEED = 1234

# 0.1 KeyPhrase Dataset
---

These are testing datasets

- Keyphrase paths

```python
dataset_names = ['inspec', 'krapivin', 'nus', 'semeval', 'kp20k', 'duc', 'stackexchange']
json_base_dir = data_base_dir + '/keyphrase/json/'
```

- Keyphrase read json file

```python
def json_keyphrase_read( dataset_name, json_base_dir, file_name=None ):
    """
    Args:
        dataset_name (string): Directory name with the json file.
        json_base_dir (string): Path to the Dataset directory.
        file_name (string): (Optional) Json file name.
        
    Return:
        json_list (list of dict): List of dictionaries, each one with the fields 
            - 'title' (string)
            - 'abstract' (string)
            - 'fulltext' (string | '')
            - 'keywords' (list)
    """
    if verbose: print(dataset_name)

    input_json_path = os.path.join(json_base_dir, dataset_name, 
                                   '%s_test.json' % dataset_name if file_name is None else file_name)

    json_list_of_dict = []
    with open(input_json_path, 'r') as input_json:
        for json_line in input_json:
            json_dict = json.loads(json_line)

            if dataset_name == 'stackexchange':
                json_dict['abstract'] = json_dict['question']
                json_dict['keywords'] = json_dict['tags']            
                del json_dict['question']
                del json_dict['tags']

            keywords = json_dict['keywords']

            if isinstance(keywords, str):
                keywords = keywords.split(';')
                json_dict['keywords'] = keywords

            json_list_of_dict.append(json_dict)
                
    return json_list_of_dict
```

This keyphrase dataset could be useful for testing some model on keyphrase task or abstract-title summarization/generation/embedding.

For now, we can avoid implementing the Dataset's and DataLoader's classes for this objects.

Although, the dataset and Dataloader would be simple as follow:

- Keyphrase process data -> tensor

```python
def data_keyphrase_process(json_list_of_dict, tokenizer, debug=False):
    data = []
    for json_dict in json_list_of_dict:
        title_tensor_ = torch.tensor(tokenizer.encode(json_dict['title']),
                                dtype=torch.long)
        if debug: print(title_tensor_)
        abstract_tensor_ = torch.tensor(tokenizer.encode(json_dict['abstract']),
                                dtype=torch.long)
        if debug: print(abstract_tensor_)
        fulltext_tensor_ = torch.tensor(tokenizer.encode(json_dict['fulltext']),
                                dtype=torch.long)
        if debug: print(fulltext_tensor_)
        keywords_tensor_ = torch.tensor(tokenizer(json_dict['keywords'], padding=True)['input_ids'], 
                                dtype=torch.long)
        if debug: print(keywords_tensor_)

        data.append((title_tensor_, abstract_tensor_, fulltext_tensor_, keywords_tensor_))
    return data
```

- Defining model and vocab

```python
# we need to get `vocab` and the `tokenizer`, all comes with *AutoTokenizer*
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModel.from_pretrained(MODEL_PATH)
```

- Extracting tokenized data
```python
# now we can use them
json_list_of_dict = json_keyphrase_read( dataset_names[0], json_base_dir )
data = data_keyphrase_process( json_list_of_dict, tokenizer )
```

The `data` object is composed by `500 tuples`, each one composed by 4 objects:
- `title_tensor_` is the title embedding (composed by integers values)
- `abstract_tensor_` is the abstract embedding (composed by integers values)
- `fulltext_tensor_` is the fulltext embedding (composed by integers values)
- `keywords_tensor_` is the keywords embedding (composed by integers values)

# 0.2 S2ORC Dataset
---

To build a generic loading function we take inspiration from [here](https://discuss.huggingface.co/t/pipeline-with-custom-dataset-tokenizer-when-to-save-load-manually/1084/11).

In [17]:
if s2orc_type == 'sample':
    metadata_filename = 'sample'
    pdf_parses_filename = 'sample'
    
    
elif s2orc_type == 'full':
    
    if N is None:
        print(f"[WARNING] You set 'full' but no bucket index was specified. \n \
        We'll use the index 0, so the first bucket will be used.")
        N = 0
        
    metadata_filename = f"metadata_{N}"
    pdf_parses_filename = f"pdf_parses_{N}"
    
else:
    raise NameError(f"You must select an existed S2ORC dataset \n \
                You selected {s2orc_type}, but options are ['sample' or 'full']")

meta_s2orc = data_base_dir +f'/s2orc-{s2orc_type}-20200705v1/{s2orc_type}/metadata/{metadata_filename}.jsonl'
pdfs_s2orc = data_base_dir +f'/s2orc-{s2orc_type}-20200705v1/{s2orc_type}/pdf_/{pdf_parses_filename}.jsonl'

         We'll use the index 0, so the first bucket will be used.


---
---
## ❌ PARTIAL PREPARE
---
---

In [94]:
# in case you need is necessary to set output_dir
output_dir=f'./tmp_trainer/#{RUN_NUMBER}_{RUN_ITER}_{RUN_NAME}'

# tokenizer from 'allenai/scibert_scivocab_uncased' or from checkpoint
PRETRAINED = MODEL_PATH
print(f"Checkpoint DOESN'T exist, model load from scratch at {PRETRAINED}.")
    
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED) 
# model = BertForMaskedLM.from_pretrained(PRETRAINED) 

Checkpoint DOESN'T exist, model load from scratch at allenai/scibert_scivocab_uncased.


loading configuration file https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/config.json from cache at /home/vivoli/.cache/huggingface/transformers/858852fd2471ce39075378592ddc87f5a6551e64c6825d1b92c8dab9318e0fc3.03ff9e9f998b9a9d40647a2148a202e3fb3d568dc0f170dda9dda194bab4d5dd
Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.3.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 31090
}

Model name 'allenai/scibert_scivocab_uncased' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, be

Model BertConfig object for 

```python
Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.3.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 31090
}
```

Max sequence length from tokenizer, model and input might be differents:

In [19]:
max_seq_length = model.config.max_position_embeddings
print(max_seq_length)

512


Here we use only the `meta_s2orc`, so we have the data that is structured such as:

## 2.1 Description (s2orc)
---
The `S2ORC` dataset is in the `data` path under the folder `s2orc-full-20200705v1` (where `s2orc` is the name of the dataset, `full` is the type, as there is also a sample fingerprint; and `20200705v1` is the version). 
We can reach the data folder exiting by the project and entering in the data folder:

In [21]:
DATA_PATH = '/home/vivoli/Thesis/data' 
!ls $DATA_PATH

keyphrase  s2orc-full-20200705v1  s2orc-sample-20200705v1  sentence-tranformers
README.md  s2orc-mini		  scibert		   snli_1.0


In [22]:
custom_path = f"{DATA_PATH}/s2orc-full-20200705v1/full"
!ls $custom_path

metadata  pdf_parses


As you can see (going into `s2orc-full-20200705v1/full/`) there are the `metadata` folder and the `pdf_parses` folder. The main difference (as we can already get it from the names) is that in the `metadata` you only have some information about the dataset (retrieved from the published metadata), while in the `pdf_parses` you get all the extensive data conteined in the paper (if the paper was present, was correctly parsed and no restriction in the paper data were applied due to limited licence permition). For some reason, the `title` of the paper is contained only in the `metadata` file, but it can get from the `paper_id` field of the paper itself.

More information about the `S2ORC` dataset can be read in the [README.md](https://github.com/allenai/s2orc/blob/master/README.md) of the project and in the [project repository](https://github.com/allenai/s2orc/)

### mag field
- MAG fields of study:

| class | Field of study | All papers | Full text |
|-------|----------------|------------|-----------|
|0      | Medicine       | 12.8M      | 1.8M      |
|1      | Biology        | 9.6M       | 1.6M      |
|2      | Chemistry      | 8.7M       | 484k      |
|3      | n/a            | 7.7M       | 583k      |
|4      | Engineering    | 6.3M       | 228k      |
|5      | Comp Sci       | 6.0M       | 580k      |
|6      | Physics        | 4.9M       | 838k      |
|7      | Mat Sci        | 4.6M       | 213k      |
|8      | Math           | 3.9M       | 669k      |
|9      | Psychology     | 3.4M       | 316k      |
|10     | Economics      | 2.3M       | 198k      |
|11     | Poli Sci       | 1.8M       | 69k       |
|12     | Business       | 1.8M       | 94k       |
|13     | Geology        | 1.8M       | 115k      |
|14     | Sociology      | 1.6M       | 93k       |
|15     | Geography      | 1.4M       | 58k       |
|16     | Env Sci        | 766k       | 52k       |
|17     | Art            | 700k       | 16k       |
|18     | History        | 690k       | 22k       |
|19     | Philosophy     | 384k       | 15k       |

## `metadata` schema

We recommend everyone work with `metadata/` as the starting point.  This is a JSONlines file (one line per paper) with the following keys:

#### Identifier fields

* `paper_id`: a `str`-valued field that is a unique identifier for each S2ORC paper.

* `arxiv_id`: a `str`-valued field for papers on [arXiv.org](https://arxiv.org).

* `acl_id`: a `str`-valued field for papers on [the ACL Anthology](https://www.aclweb.org/anthology/).

* `pmc_id`: a `str`-valued field for papers on [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/articles).

* `pubmed_id`: a `str`-valued field for papers on [PubMed](https://pubmed.ncbi.nlm.nih.gov/), which includes MEDLINE.  Also known as `pmid` on PubMed.

* `mag_id`: a `str`-valued field for papers on [Microsoft Academic](https://academic.microsoft.com).

* `doi`: a `str`-valued field for the [DOI](http://doi.org/).  

Notably:

* Resolved citation links are represented by the cited paper's `paper_id`.

* The `paper_id` resolves to a Semantic Scholar paper page, which can be verified using the `s2_url` field.

* We don't always have a value for every identifier field.  When missing, they take `null` value.


#### Metadata fields

* `title`: a `str`-valued field for the paper title.  Every S2ORC paper *must* have one, though the source can be from publishers or parsed from PDFs.  We prioritize publisher-provided values over parsed values.

* `authors`: a `List[Dict]`-valued field for the paper authors.  Authors are listed in order.  Each dictionary has the keys `first`, `middle`, `last`, and `suffix` for the author name, which are all `str`-valued with exception of `middle`, which is a `List[str]`-valued field.  Every S2ORC paper *must* have at least one author.

* `venue` and `journal`: `str`-valued fields for the published venue/journal.  *Please note that there is not often agreement as to what constitutes a "venue" versus a "journal". Consolidating these fields is being considered for future releases.*   

* `year`: an `int`-valued field for the published year.  If a paper is preprinted in 2019 but published in 2020, we try to ensure the `venue/journal` and `year` fields agree & prefer non-preprint published info. *We know this decision prohibits certain types of analysis like comparing preprint & published versions of a paper.  We're looking into it for future releases.*  

* `abstract`: a `str`-valued field for the abstract.  These are provided directly from gold sources (not parsed from PDFs).  We preserve newline breaks in structured abstracts, which are common in medical papers, by denoting breaks with `':::'`.     

* `inbound_citations`: a `List[str]`-valued field containing `paper_id` of other S2ORC papers that cite the current paper.  *Currently derived from PDF-parsed bibliographies, but may have gold sources in the future.*

* `outbound_citations`: a `List[str]`-valued field containing `paper_id` of other S2ORC papers that the current paper cites.  Same note as above.   

* `has_inbound_citations`: a `bool`-valued field that is `true` if `inbound_citations` has at least one entry, and `false` otherwise.

* `has_outbound_citations` a `bool`-valued field that is `true` if `outbound_citations` has at least one entry, and `false` otherwise.

We don't always have a value for every metadata field.  When missing, `str` fields take `null` value, while `List` fields are empty lists.

#### PDF parse-related metadata fields

* `has_pdf_parse`:  a `bool`-valued field that is `true` if this paper has a corresponding entry in `pdf_parses/`, which means we had processed that paper's PDF(s) at some point.  The field is `false` otherwise.

* `has_pdf_parsed_abstract`: a `bool`-valued field that is `true` if the paper's PDF parse contains a parsed abstract, and `false` otherwise.   

* `has_pdf_parsed_body_text`: a `bool`-valued field that is `true` if the paper's PDF parse contains parsed body text, and `false` otherwise.

* `has_pdf_parsed_bib_entries`: a `bool`-valued field that is `true` if the paper's PDF parse contains parsed bibliography entries, and `false` otherwise.

* `has_pdf_parsed_ref_entries`: a `bool`-valued field that is `true` if the paper's PDF parse contains parsed reference entries (e.g. tables, figures), and `false` otherwise.

Please note:

* If `has_pdf_parse = false`, the other four fields will not be present in the JSON (trivially `false`).

* If `has_pdf_parse = true` but `has_pdf_parsed_abstract`, `has_pdf_parsed_body_text`, or `has_pdf_parsed_ref_entries` are `false`, this can be because:

    * Our PDF parser failed to extract that element
    * Our PDF parser succeeded but that paper simply did not have that element (e.g. papers without abstracts)
    * Our PDF parser succeeded but that element was removed because the paper is not identified as open-access.  


##### metadata_CLASS
```python
{
 "paper_id": (string), 
 "title": (string), 
 "authors": [
     {
         "first": (string), 
         "middle": [], 
         "last": (string), 
         "suffix": (string)
     },
     ...
   ]: **Author_Class**, 
 "abstract": (string), 
 "year": (int), 
 "arxiv_id": null, 
 "acl_id": null, 
 "pmc_id": null, 
 "pubmed_id": null, 
 "doi": null, 
 "venue": null, 
 "journal": (string), 
 "mag_id": (string-number), 
 "mag_field_of_study": [
     "Medicine",
     "Computer Science"
   ]: **FieldOfStudy_Enum**, 
 "outbound_citations": [], 
 "inbound_citations": [], 
 "has_outbound_citations": false, 
 "has_inbound_citations": false, 
 "has_pdf_parse": false, 
 "s2_url": (string)
}
```

Here I represent Author_Class as an object of 
```python
{
    "first": (string), 
    "middle": [], 
    "last": (string), 
    "suffix": (string)
}
```
and `FieldOfStudy_Enum` as an Enum of string such as `[ "Medicine", "Computer Science", "Physics", "Mathematics", ... ]`

In [24]:
def partial_prepare_data(dataset_f: str,
                tokenizer: PreTrainedTokenizer,
                max_seq_length: int = None,
                batch_size: int = 64,
                num_workers: int = 4,
                seed: int = SEED,
                data_field: List[str] =  ["title", "abstract"]) -> Dict[str, DataLoader]:
    """Given an input file, prepare the train, test, validation dataloaders.
    :param dataset_f: input file (format: .txt; line by line)
    :param tokenizer: pretrained tokenizer that will prepare the data, i.e. convert tokens into IDs
    :param max_seq_length: maximal sequence length. Longer sequences will be truncated
    :param batch_size: batch size for the dataloaders
    :param num_workers: number of CPU workers to use during dataloading. On Windows this must be zero
    :return: a dictionary containing train, test, validation dataloaders
    """
    print_all_debug = False
    time_debug = True
    print_some_debug = True

    ## ------------------ ##
    ## -- LOAD DATASET -- ##
    ## ------------------ ##
    if time_debug: start = time.time()
    if time_debug: start_load = time.time()
        
    ## execution
    max_seq_length = tokenizer.model_max_length if not max_seq_length else max_seq_length
    if print_some_debug: print(max_seq_length)
    dataset_dict = load_dataset("json", data_files=dataset_f)

    if time_debug: end_load = time.time()
    if time_debug: print(f"[TIME] load_dataset: {end_load - start_load}")
    
    ## ------------------ ##
    ## ---- MANAGING ---- ##
    ## ------------------ ##
    if time_debug: start_selection = time.time()
    
    ## execution
    dataset = dataset_dict['train']
    
    if time_debug: end_selection = time.time()
    if time_debug: print(f"[TIME] dataset_train selection: {end_selection - start_selection}")
    if print_all_debug: print(dataset)
   
    ## ------------------ ##
    ## --- REMOVE none -- ##
    ## ------------------ ##
    if time_debug: start_removing = time.time()
    # clean input removing papers with **None** as abstract/title
    if remove_None_papers:

        ## --------------------- ##
        ## --- REMOVE.indexes -- ##
        ## --------------------- ##
        if time_debug: start_removing_indexes = time.time()
        if print_all_debug: print(data_field)
        
        ## execution
        none_papers_indexes = {}
        for field in data_field:
            none_indexes = [ idx_s for idx_s, s in enumerate(dataset[f"{field}"]) if s is None]
            none_papers_indexes = {**none_papers_indexes, **dict.fromkeys(none_indexes , False)}

        if time_debug: end_removing_indexes = time.time()
        if time_debug: print(f"[TIME] remove.indexes: {end_removing_indexes - start_removing_indexes}")
        if print_all_debug: print(none_papers_indexes)
        
        ## --------------------- ##
        ## --- REMOVE.concat --- ##
        ## --------------------- ##
        if time_debug: start_removing_concat = time.time()
        
        ## execution
        to_remove_indexes = list(none_papers_indexes.keys())

        if time_debug: end_removing_concat = time.time()
        if time_debug: print(f"[TIME] remove.concat: {end_removing_concat - start_removing_concat}")
        if print_all_debug: print(to_remove_indexes)
        if print_all_debug: print([ dataset["abstract"][i] for i in to_remove_indexes])

        ## --------------------- ##
        ## --- REMOVE.filter --- ##
        ## --------------------- ##
        if time_debug: start_removing_filter = time.time()
        
        ## execution
        dataset = dataset.filter((lambda x, ids: none_papers_indexes.get(ids, True)), with_indices=True)
        
        if time_debug: end_removing_filter = time.time()
        if time_debug: print(f"[TIME] remove.filter: {end_removing_filter - start_removing_filter}")
        if print_all_debug: print(dataset)

        
    if time_debug: end_removing = time.time()
    if time_debug: print(f"[TIME] remove None fields: {end_removing - start_removing}")

    ## --------------------- ##
    ## --- REMOVE.column --- ##
    ## --------------------- ##
    if time_debug: start_remove_unused_columns = time.time()
    if remove_Unused_columns:
        
        for column in dataset.column_names:
            if column not in data_field:
                if debug: print(f"{column}")
                dataset.remove_columns_(column)

    if time_debug: end_remove_unused_columns = time.time()
    if time_debug: print(f"[TIME] remove.column: {end_remove_unused_columns - start_remove_unused_columns}")
        
    ## ------------------ ##
    ## --- SPLIT 1.    -- ##
    ## ------------------ ##
    if time_debug: start_first_split = time.time()
    
    # 80% (train), 20% (test + validation)
    ## execution
    train_testvalid = dataset.train_test_split(test_size=0.2, seed=SEED)
    
    if time_debug: end_first_split = time.time()
    if time_debug: print(f"[TIME] first [train-(test-val)] split: {end_first_split - start_first_split}")

    ## ------------------ ##
    ## --- SPLIT 2.    -- ##
    ## ------------------ ##
    if time_debug: start_second_split = time.time()
    
    # 10% of total (test), 10% of total (validation)
    ## execution
    test_valid = train_testvalid["test"].train_test_split(test_size=0.5, seed=SEED)

    if time_debug: end_second_split = time.time()
    if time_debug: print(f"[TIME] second [test-val] split: {end_second_split - start_second_split}")

    ## execution
    dataset = DatasetDict({"train": train_testvalid["train"],
                          "test": test_valid["test"],
                          "valid": test_valid["train"]})
    if time_debug: end = time.time()
    if time_debug: print(f"[TIME] TOTAL: {end - start}") 
    return dataset

In [25]:
%time

dictionary_input = { "data": ["abstract"], "target": ["title"], "classes": ["mag_field_of_study"]}
dictionary_columns = sum(dictionary_input.values(), [])

# here we use meta_s2orc for speed, 
dataset = partial_prepare_data(meta_s2orc, tokenizer, data_field=dictionary_columns, max_seq_length=max_seq_length)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.05 µs
512


Using custom data configuration default-d0f4b114df4979ef
Reusing dataset json (/home/vivoli/.cache/huggingface/datasets/json/default-d0f4b114df4979ef/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02)


[TIME] load_dataset: 1.4133775234222412
[TIME] dataset_train selection: 7.152557373046875e-07
[TIME] remove.indexes: 5.205094814300537
[TIME] remove.concat: 0.0026721954345703125


Loading cached processed dataset at /home/vivoli/.cache/huggingface/datasets/json/default-d0f4b114df4979ef/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02/cache-91c1b95631f59966.arrow
remove_columns_ is deprecated and will be removed in the next major version of datasets. Use the dataset.remove_columns method instead.
Loading cached split indices for dataset at /home/vivoli/.cache/huggingface/datasets/json/default-d0f4b114df4979ef/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02/cache-a5d01c93d0c6f581.arrow and /home/vivoli/.cache/huggingface/datasets/json/default-d0f4b114df4979ef/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02/cache-fb2c374f9e35d33e.arrow
Loading cached split indices for dataset at /home/vivoli/.cache/huggingface/datasets/json/default-d0f4b114df4979ef/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02/cache-dd364e56fbe42c75.arrow and /home/vivoli/.cache/huggingface/datasets/json/de

[TIME] remove.filter: 1.443756341934204
[TIME] remove None fields: 6.651668548583984
[TIME] remove.column: 0.010538339614868164
[TIME] first [train-(test-val)] split: 0.02309584617614746
[TIME] second [test-val] split: 0.0066204071044921875
[TIME] TOTAL: 8.105465173721313


In [26]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['title', 'abstract', 'mag_field_of_study'],
        num_rows: 609048
    })
    test: Dataset({
        features: ['title', 'abstract', 'mag_field_of_study'],
        num_rows: 76132
    })
    valid: Dataset({
        features: ['title', 'abstract', 'mag_field_of_study'],
        num_rows: 76131
    })
})


In [31]:
def preprocess(*sentences_by_column, data, target, classes):
    """Preprocess the raw input sentences from the text file.
    :param sentences: a list of sentences (strings)
    :return: a dictionary of "input_ids"
    """
    print_all_debug = False
    time_debug = False
    print_some_debug = False

    if debug: print(f"[INFO-START] Preprocess on data: {data}, target: {target}") 
    
    assert data == ['abstract'], "data should be ['abstract']"
    if debug: print(data)
    assert target == ['title'], "target should be ['title']"
    if debug: print(target)
        
    data_columns_len = len(data)
    target_columns_len = len(target)
    columns_len = data_columns_len + target_columns_len
    
    assert data_columns_len == 1, "data length should be 1"
    if debug: print(data_columns_len)
    assert target_columns_len == 1, "target length should be 1"
    if debug: print(target_columns_len)
        
    sentences_by_column = np.asarray(sentences_by_column)
    input_columns_len = len(sentences_by_column)
    
    if debug: print(f'all sentences (len {input_columns_len}): {sentences_by_column}')
    
    if target_columns_len == 0:
        raise NameError("No target variable selected, \
                    are you sure you don't want any target?")
        
    data_sentences = sentences_by_column[0]
    target_sentences = sentences_by_column[1] # if columns_len == input_columns_len else sentences_by_column[data_columns_len:-1]
    
    if debug: print(data_sentences)
    if debug: print(target_sentences)

    """
    # clean input removing **None**, converting them to **''**
    if clean_None_data:
        data_sentences = np.asarray([ s if s is not None else '' for s in data_sentences])
        target_sentences = np.asarray([ s if s is not None else '' for s in target_sentences])

    # clean input removing papers with **None** as abstract/title
    elif remove_None_data:
        none_data_indexes = np.asarray([ idx_s for idx_s, s in enumerate(data_sentences) if s is None])
        none_target_indexes = np.asarray([ idx_s for idx_s, s in enumerate(target_sentences) if s is None])

        if debug: print(none_data_indexes)
        if debug: print(none_target_indexes)

        to_removed_indexes = np.unique(none_data_indexes, none_target_indexes)

        if debug: print(to_removed_indexes)

        data_sentences = np.delete(data_sentences, to_removed_indexes)
        target_sentences = np.delete(target_sentences, to_removed_indexes)
    
    if debug: print(data_sentences)
    if debug: print(target_sentences)
    """
    
    # sentences = [s for s in sentences if s is not None]
    # tokens = [s.strip().split() for s in sentences]
    # tokens = [t[:max_seq_length - 1] + [tokenizer.eos_token] for t in tokens]

    # The sequences are not padded here. we leave that to the dataloader in a collate_fn
    # ----------------------------------------------- #
    # -------- TODO include the `collate_fn` -------- #
    # ----------------------------------------------- #
    # That means: a bit slower processing, but a smaller saved dataset size
    if print_some_debug: print(max_seq_length)
        
    data_encoded_d = tokenizer(
                        text=data_sentences.tolist(),
                        # add_special_tokens=False,
                        # is_pretokenized=True,
                        padding=True, truncation=True, max_length=max_seq_length,
                        return_token_type_ids=False,
                        return_attention_mask=False,
                        # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it
                        # receives the `special_tokens_mask`.
                        return_special_tokens_mask=True,
                        return_tensors='np'
    )
    
    target_encoded_d = tokenizer(
                        text=target_sentences.tolist(),
                        # add_special_tokens=False,
                        # is_pretokenized=True,
                        padding=True, truncation=True, max_length=max_seq_length,
                        return_token_type_ids=False,
                        return_attention_mask=False,
                        # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it
                        # receives the `special_tokens_mask`.
                        return_special_tokens_mask=True,
                        return_tensors='np'
    )

                            

    if debug: print(data_encoded_d["input_ids"].shape)
    if debug: print(target_encoded_d["input_ids"].shape)
    # return encoded_d
    
    return {"data_input_ids": data_encoded_d["input_ids"], "target_input_ids": target_encoded_d["input_ids"]}
    # return {"input_ids": sum(encoded_d['input_ids'], [])} 

print an example
```python 
print(dataset['train'][:10]['title'], dataset['train'][:10]['abstract'])
```

In [32]:
vocab = tokenizer.get_vocab()
print(f"[PAD]: {vocab['[PAD]']}")
print(f"[UNK]: {vocab['[UNK]']}")
print(f"[SEP]: {vocab['[SEP]']}")
print(f"[CLS]: {vocab['[CLS]']}")
print(f"0: {tokenizer.convert_ids_to_tokens(0)}")
print(f"1: {tokenizer.convert_ids_to_tokens(1)}")
print(f"2: {tokenizer.convert_ids_to_tokens(2)}")
print(f"99: {tokenizer.convert_ids_to_tokens(99)}")
print(f"100: {tokenizer.convert_ids_to_tokens(100)}")
print(f"101: {tokenizer.convert_ids_to_tokens(101)}")

[PAD]: 0
[UNK]: 101
[SEP]: 103
[CLS]: 102
0: [PAD]
1: [unused0]
2: [unused1]
99: [unused98]
100: [unused99]
101: [UNK]


In [33]:
tokenizer

PreTrainedTokenizerFast(name_or_path='allenai/scibert_scivocab_uncased', vocab_size=31090, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

Finally, I found [this](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=datasetdict#datasets.DatasetDict.map) documentation for the function `DatasetDict.map` from the `dataset` library.

In [34]:
dataset_map = dataset.map(preprocess, input_columns= dictionary_columns, fn_kwargs= dictionary_input, batched=True)

  return array(a, dtype, copy=False, order=order)


HBox(children=(FloatProgress(value=0.0, max=610.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=77.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=77.0), HTML(value='')))




In [1]:
dataset_map

NameError: name 'dataset_map' is not defined

##### adding also the `mag_index`: the  `mag_field_of_study`  index present in the table

In [36]:
mag_field_dict: Dict = {
    "Medicine":    0,
    "Biology":     1,
    "Chemistry":   2,
    "Engineering": 4,
    "Computer Science":    5,
    "Physics":     6,
    "Materials Science":     7,
    "Mathematics":        8,
    "Psychology":  9,
    "Economics":   10,
    "Political Science":    11,
    "Business":    12,
    "Geology":     13,
    "Sociology":   14,
    "Geography":   15,
    "Environmental Science":     16,
    "Art":         17,
    "History":     18,
    "Philosophy":  19
    # "null":         3, 
}

# The key null is actually null, not "null":str
#
#      real_mag_field_value = paper_metadata['mag_field_of_study']
#
# so we could return the id 3 if it not contained as key of dictionary
#
#      mag_field_dict.get(real_mag_field_value, 3)
#

In [91]:
def mag_preprocessing(*mags):
    """Preprocess the raw input sentences from the text file.
    :param sentences: a list of sentences (strings)
    :return: a dictionary of "input_ids"
    """
    debug = False
    
    if debug: print(f"[INFO-START] Mag Preprocess") 
        
    mag_field = np.array(mags)
    input_columns_len = mag_field.shape
    if debug: print(f'pre flatten (len {input_columns_len}): {mag_field}')
    if debug: print(f'pre types: {[type(ele) for ele in mag_field]}')
    if debug: print(f'pre types: {type(mag_field)}')
    
    mag_field = mag_field.flatten()
    input_columns_len = mag_field.shape
    if debug: print(f'after flatten (len {input_columns_len}): {mag_field}')
    if debug: print(f'after types: {[type(ele) for ele in mag_field]}')
    if debug: print(f'after types: {type(mag_field)}')
        
    mag_field = np.array([ele if type(ele) == str else list(ele)[0] for ele in mag_field])
        
    if input_columns_len == 0:
        raise NameError("No mag variable selected, \
                    are you sure you don't want any target?")
    
    if debug: print(mag_field)
    if debug: print(mag_field_dict)
    if debug: print([mag_field_dict.get(real_mag_field_value, 3) for real_mag_field_value in mag_field])
        
    mag_index = np.asarray([mag_field_dict.get(real_mag_field_value, 3) for real_mag_field_value in mag_field])
    
    if debug: print(mag_index)
    
    return {"mag_index": mag_index}
    # return

In [92]:
dataset_mag_map = dataset_map.map(mag_preprocessing, input_columns= dictionary_input['classes'], batched=True)


HBox(children=(FloatProgress(value=0.0, max=610.0), HTML(value='')))

  # Remove the CWD from sys.path while we load stuff.





HBox(children=(FloatProgress(value=0.0, max=77.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=77.0), HTML(value='')))




In [93]:
dataset_mag_map

DatasetDict({
    train: Dataset({
        features: ['abstract', 'data_input_ids', 'mag_field_of_study', 'mag_index', 'target_input_ids', 'title'],
        num_rows: 609048
    })
    test: Dataset({
        features: ['abstract', 'data_input_ids', 'mag_field_of_study', 'mag_index', 'target_input_ids', 'title'],
        num_rows: 76132
    })
    valid: Dataset({
        features: ['abstract', 'data_input_ids', 'mag_field_of_study', 'mag_index', 'target_input_ids', 'title'],
        num_rows: 76131
    })
})

# Rename it as you want
---

- `dataset_map.rename_column` ,method for renaming
- `dataset_map.set_format`, method for define what columns need to be returned

In [None]:
dataset_map = dataset_map.rename_column("data_input_ids", "input_ids")

In [None]:
dataset_map.set_format("torch", columns=["input_ids"])

In [None]:
print(dataset_map['train'][1]['input_ids'].size())

Then, if you want to store it, it will be stored in the conda environment you are

In [None]:
%store dataset_map

---
---
## ❌ FAKE PIPELINE for train BERT-based NETS
---
---

In [1]:
%store -r dataset_map

In [2]:
dataset_map

DatasetDict({
    train: Dataset({
        features: ['abstract', 'input_ids', 'target_input_ids', 'title'],
        num_rows: 612900
    })
    test: Dataset({
        features: ['abstract', 'input_ids', 'target_input_ids', 'title'],
        num_rows: 76613
    })
    valid: Dataset({
        features: ['abstract', 'input_ids', 'target_input_ids', 'title'],
        num_rows: 76613
    })
})

In [None]:
# tokenizer: we already have it
# model: we already have it

# If you print some element from `dataset_map['train'][element_index]['input_ids']` you'll see that lots of element
vect = [ele[ele.nonzero()].size(0) for ele in dataset_map['train'][:]['input_ids']]

In [14]:
max_vect = max(vect)
min_vect = min(vect)
sum_vect = sum(vect)
len_vect = len(vect)

print(f" max: {max_vect} \n min: {min_vect} \n avg: {sum_vect/len_vect}")

 max: 512 
 min: 2 
 avg: 204.6087681514113


From [here](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py) you can get an idea from were the code has been borrowed.