<a href="https://colab.research.google.com/github/bystrowska/idiom-paraphrasing/blob/main/idiom_paraphrasing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set up

In [1]:
!pip install datasets transformers

Collecting datasets
  Downloading datasets-2.1.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 5.3 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 57.0 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.0 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 50.2 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 64.8 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.

Load the dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Dataset pre-processing

In [None]:
from datasets import load_dataset
dataset = load_dataset('csv', data_files="drive/MyDrive/Colab data/data.csv")
dataset

Using custom data configuration default-31c18aea220c8b74
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-31c18aea220c8b74/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['Idiom', 'Sense', 'Idiomatic_Sent', 'Literal_Sent', 'Idiomatic_Label', 'Literal_Label'],
        num_rows: 5170
    })
})

In [None]:
print("number of sentences: " + str(len(dataset["train"])))
print("number of unique idioms: " + str(len(dataset["train"].unique("Idiom"))))

number of sentences: 5170
number of unique idioms: 823


## Split the dataset into training and validate sets.

The `train_test_split` method shuffles the data and splits it into `train` and `test`. In order to get `train`, `test` and `validate` I'll first split it into `test` and "the rest" and then split the rest into `train` and `validate`.

The book we used in ML module last semester recommends 50:25:25 split if there's plenty of data and 60:20:20 otherwise. Since I don't have a lot of data I'll go with the second.


###Split data into **train** and **test** with the split being 80:20 -> test is 20% of the whole

In [None]:
split_dataset = dataset['train'].train_test_split(test_size=0.2)

In [None]:
print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['Idiom', 'Sense', 'Idiomatic_Sent', 'Literal_Sent', 'Idiomatic_Label', 'Literal_Label'],
        num_rows: 4136
    })
    test: Dataset({
        features: ['Idiom', 'Sense', 'Idiomatic_Sent', 'Literal_Sent', 'Idiomatic_Label', 'Literal_Label'],
        num_rows: 1034
    })
})


###Split `dataset[train]` into `test` and `validate` -> validate is meant to be 20% of the whole, so 25% of the current train set

In [None]:
idk = split_dataset['train'].train_test_split(test_size=0.25)
split_dataset['train'] = idk['train']
split_dataset['validate'] = idk['test']

In [None]:
print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['Idiom', 'Sense', 'Idiomatic_Sent', 'Literal_Sent', 'Idiomatic_Label', 'Literal_Label'],
        num_rows: 3102
    })
    test: Dataset({
        features: ['Idiom', 'Sense', 'Idiomatic_Sent', 'Literal_Sent', 'Idiomatic_Label', 'Literal_Label'],
        num_rows: 1034
    })
    validate: Dataset({
        features: ['Idiom', 'Sense', 'Idiomatic_Sent', 'Literal_Sent', 'Idiomatic_Label', 'Literal_Label'],
        num_rows: 1034
    })
})


### Save to disk

In [None]:
split_dataset.save_to_disk("drive/MyDrive/Colab data/clean_dataset") #saves dataset in arrow format

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-31c18aea220c8b74/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-5ee86d3e33f885c6.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-31c18aea220c8b74/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-a98b5f8cb04d583b.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-31c18aea220c8b74/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-06cb6f5eeb52312b.arrow


#Tokenizing

In [2]:
from datasets import load_from_disk

dataset = load_from_disk("drive/MyDrive/Colab data/clean_dataset")
dataset

DatasetDict({
    train: Dataset({
        features: ['Idiom', 'Sense', 'Idiomatic_Sent', 'Literal_Sent', 'Idiomatic_Label', 'Literal_Label'],
        num_rows: 3102
    })
    test: Dataset({
        features: ['Idiom', 'Sense', 'Idiomatic_Sent', 'Literal_Sent', 'Idiomatic_Label', 'Literal_Label'],
        num_rows: 1034
    })
    validate: Dataset({
        features: ['Idiom', 'Sense', 'Idiomatic_Sent', 'Literal_Sent', 'Idiomatic_Label', 'Literal_Label'],
        num_rows: 1034
    })
})

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small") # using t5-small for testing to hopefully save time


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

## Do processing required for T5

1) Add prefix

2) T5 only needs `input_ids` from the input and output sequence. `input_ids` for the target sequence are called `labels` to differentiate between them and the ones for input

3) truncate sequences longer then max length (I chose a length > longest sequence for now)

4) change values for `pad_token_id` to `-100` in `labels` so that they are ignored when calculating the loss function


In [38]:
# get max sentence length
max_input_length = 0
max_target_length = 0
for split in dataset:
  val = len(max(dataset[split]["Literal_Sent"], key=len))
  max_target_length = val if val > max_target_length else max_target_length
  val = len(max(dataset[split]["Idiomatic_Sent"], key=len))
  max_input_length = val if val > max_input_length else max_input_length

(max_input_length, max_target_length)

(353, 300)

In [47]:
prefix = "paraphrase: "
max_input_length += len(prefix) + 1
max_target_length += 1


def preprocess_function(examples):
    inputs = [prefix + ex for ex in examples["Idiomatic_Sent"]]
    targets = examples["Literal_Sent"]

    model_inputs = tokenizer(inputs,
                             padding="longest",
                             max_length=max_input_length,
                             truncation=True)

    labels = tokenizer(targets,
                       padding="longest",
                       max_length=max_target_length).input_ids

    model_inputs["labels"] = [[val if val != tokenizer.pad_token_id else -100 for val in array] for array in labels]
    return model_inputs

In [48]:
tokenized_datasets = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["train"].column_names,
)
tokenized_datasets

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3102
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1034
    })
    validate: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1034
    })
})

### TODO:
I added padding here which is good if I'll be using TPU but for CPU dynamic padding is probably better, then the -100 thing should be done later too.

### Save to disk

In [52]:
tokenized_datasets.save_to_disk("drive/MyDrive/Colab data/tokenized_dataset") #saves dataset in arrow format