# Feature Transformation in this Notebook

In this notebook, we convert raw text into feature embeddings.  This will allow us to perform natural language processing tasks.

![Pipeline](./img/generative_ai_pipeline_rlhf_plus.png)

# Understand Embeddings

* For more details on Transformers Architecture, see [Attention Is All You Need](https://arxiv.org/abs/1706.03762).

* **input_ids**: 
The id from the pre-trained vocabulary that represents the token. (Padding of 0 will be used if the # of tokens is less than max_seq_length)

* **attention_mask**: 
Specifies which tokens should pay attention to (0 or 1). Padded input_ids will have 0 in each of these vector elements.

In [2]:
import psutil

notebook_memory = psutil.virtual_memory()
print(notebook_memory)

if notebook_memory.total < 32 * 1000 * 1000 * 1000:
    print('*******************************************')    
    print('YOU ARE NOT USING THE CORRECT INSTANCE TYPE')
    print('PLEASE CHANGE INSTANCE TYPE TO  m5.2xlarge ')
    print('*******************************************')
else:
    correct_instance_type=True

svmem(total=33229979648, available=25437077504, percent=23.5, used=7377715200, free=7513235456, active=8318095360, inactive=15399342080, buffers=2768896, cached=18336260096, shared=1044480, slab=1490767872)


In [3]:
%store -r setup_dependencies_passed

In [4]:
try:
    setup_dependencies_passed
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN THE PREVIOUS NOTEBOOK ")
    print("You did not install the required libraries.   ")
    print("++++++++++++++++++++++++++++++++++++++++++++++")

In [5]:
from transformers import AutoTokenizer
from datasets import load_dataset, DatasetDict
import os
import time

## Ensure the Base Dataset is Downloaded

In [6]:
if os.path.isdir('./data-summarization/'):
    print('Dataset already downloaded')
else:
    from datasets import concatenate_datasets
    dataset = load_dataset("knkarthick/dialogsum")
    dataset = concatenate_datasets([dataset['train'], dataset['test'], dataset['validation']])
    !mkdir data-summarization
    dataset = dataset.train_test_split(0.5, seed=1234)
    dataset['test'].to_csv('./data-summarization/dialogsum-1.csv', index=False)
    dataset['train'].to_csv('./data-summarization/dialogsum-2.csv', index=False)

Using custom data configuration knkarthick--dialogsum-6d41e9a7b96e340e
Found cached dataset csv (/root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-6d41e9a7b96e340e/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


  0%|          | 0/3 [00:00<?, ?it/s]

Creating CSV from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

## Load the Tokenizer and HuggingFace Dataset

In [7]:
model_checkpoint = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
dataset = load_dataset('./data-summarization/')
dataset

Using custom data configuration data-summarization-c29c795d447be033


Downloading and preparing dataset csv/data-summarization to /root/.cache/huggingface/datasets/csv/data-summarization-c29c795d447be033/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/data-summarization-c29c795d447be033/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 14460
    })
})

## Explore an Example Prompt

In [8]:
idx = 40
diag = dataset['train'][idx]['dialogue']
baseline_human_summary = dataset['train'][idx]['summary']

prompt = f'Summarize the following conversation.\n\n{diag}\n\nSummary:'
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

print(f'Prompt:\n--------------------------\n{prompt}\n--------------------------')
print(f'Baseline human summary : {baseline_human_summary}')

Token indices sequence length is longer than the specified maximum sequence length for this model (601 > 512). Running this sequence through the model will result in indexing errors


Prompt:
--------------------------
Summarize the following conversation.

#Person1#: Ah! No! Damn it!
#Person2#: It's a blackout. Now I can't see Seinfeld.
#Person1#: So what? I just lost one hour's worth of work.
#Person2#: Really? How could you do that? Don't you save every couple minutes?
#Person1#: No, I didn't save this time. Damn it! And I'm sick of writing this paper. Now I have to write it all over again too.
#Person2#: I've had that problem too many times. So I learned to save. When I'm writing something, I save every three sentences or so. I don't want to lose anything.
#Person1#: I hate computers. Sometimes I think they cause more trouble than they're worth.
#Person2#: What are we going to do now?
#Person1#: I don't know. I feel like going out.
#Person2#: I wonder how much of the city is down.
#Person1#: It doesn't matter. I still can go out and buy a beer.
#Person2#: Maybe. But if there's a blackout, probably the pubs are closed. And besides, I know you have a political sci

## Tokenize the Dataset

In [9]:
def tokenize_function(example):
    prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    inp = [prompt + i + end_prompt for i in example["dialogue"]]
    example['input_ids'] = tokenizer(inp, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    return example

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

  0%|          | 0/15 [00:00<?, ?ba/s]

In [10]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 14460
    })
})

## Wrap the preprocessing into a repeatable function

In [11]:
def tokenize_function(example):
    prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    inp = [prompt + i + end_prompt for i in example["dialogue"]]
    example['input_ids'] = tokenizer(inp, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    return example

def transform_dataset(input_path,
                      output_path,
                      huggingface_model_name,
                      train_dataset_percentage,
                      test_dataset_percentage,
                      validation_dataset_percentage,
                      ):

    # load in the original dataset
    dataset = load_dataset(input_path)
    print(f'Dataset loaded from path: {input_path}\n{dataset}')
    
    # Load the tokenizer
    print(f'Loading the tokenizer for the model {huggingface_model_name}')
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
    
    # make train test validation split
    train_testvalid = dataset['train'].train_test_split(1 - train_dataset_percentage, seed=1234)
    test_valid = train_testvalid['test'].train_test_split(test_dataset_percentage / (validation_dataset_percentage + test_dataset_percentage), seed=1234)
    train_test_valid_dataset = DatasetDict(
        {
            'train': train_testvalid['train'],
            'test': test_valid['test'],
            'validation': test_valid['train']
        }
    )
    print(f'Dataset after splitting:\n{train_test_valid_dataset}')
    
    # tokenize the dataset
    print(f'Tokenizing the dataset...')
    tokenized_datasets = train_test_valid_dataset.map(tokenize_function, batched=True)
    tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary'])
    print(f'Tokenizing complete!')
    
    # create directory for drop
    os.makedirs(f'{output_path}/train/', exist_ok=True)
    os.makedirs(f'{output_path}/test/', exist_ok=True)
    os.makedirs(f'{output_path}/validation/', exist_ok=True)
    file_root = str(int(time.time()*1000))
    
    # save the dataset to disk
    print(f'Writing the dataset to {output_path}')
    tokenized_datasets['train'].to_parquet(f'./{output_path}/train/{file_root}.parquet')
    tokenized_datasets['test'].to_parquet(f'./{output_path}/test/{file_root}.parquet')
    tokenized_datasets['validation'].to_parquet(f'./{output_path}/validation/{file_root}.parquet')
    print('Preprocessing complete!')

In [12]:
def process(args):

    print(f"Listing contents of {args.input_path}")
    dirs_input = os.listdir(args.input_path)
    for file in dirs_input:
        print(file)

    transform_dataset(input_path=args.input_path, #'./data-summarization/',
                      output_path=args.output_path, #'./data-summarization-processed/',
                      huggingface_model_name=args.model_checkpoint, #model_checkpoint,
                      train_dataset_percentage=args.train_dataset_percentage, #0.90
                      test_dataset_percentage=args.test_dataset_percentage, #0.05
                      validation_dataset_percentage=args.validation_dataset_percentage, #0.05
                     )

    print(f"Listing contents of {args.output_path}")
    dirs_output = os.listdir(args.output_path)
    for file in dirs_output:
        print(file)

In [13]:
class Args:
    input_path: str
    output_path: str
    model_checkpoint: str
    train_dataset_percentage: float
    test_dataset_percentage: float
    validation_dataset_percentage: float

args = Args()

args.model_checkpoint = model_checkpoint
args.input_path = './data-summarization'
args.output_path = './data-summarization-processed'
args.train_dataset_percentage = 0.85
args.test_dataset_percentage = 0.1
args.validation_dataset_percentage = 0.05

process(args)

Using custom data configuration data-summarization-c29c795d447be033
Found cached dataset csv (/root/.cache/huggingface/datasets/csv/data-summarization-c29c795d447be033/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


Listing contents of ./data-summarization
dialogsum-2.csv
dialogsum-1.csv


  0%|          | 0/1 [00:00<?, ?it/s]

Dataset loaded from path: ./data-summarization
DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 14460
    })
})
Loading the tokenizer for the model google/flan-t5-base
Dataset after splitting:
DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12290
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1447
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 723
    })
})
Tokenizing the dataset...


  0%|          | 0/13 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Tokenizing complete!
Writing the dataset to ./data-summarization-processed


Creating parquet from Arrow format:   0%|          | 0/13 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Preprocessing complete!
Listing contents of ./data-summarization-processed
test
train
validation


## Ensure the dataset can be loaded correctly

In [14]:
dataset = load_dataset(
    './data-summarization-processed/',
    data_files={'train': 'train/*.parquet', 'test': 'test/*.parquet', 'validation': 'validation/*.parquet'}
)
dataset

Using custom data configuration data-summarization-processed-4348f97c1d1dec65


Downloading and preparing dataset parquet/data-summarization-processed to /root/.cache/huggingface/datasets/parquet/data-summarization-processed-4348f97c1d1dec65/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/data-summarization-processed-4348f97c1d1dec65/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12290
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1447
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 723
    })
})