# Feature Transformation in this Notebook

In this notebook, we convert raw text into feature embeddings.  This will allow us to perform natural language processing tasks.

![Pipeline](./img/generative_ai_pipeline_rlhf_plus.png)

# Understand Embeddings

* For more details on Transformers Architecture, see [Attention Is All You Need](https://arxiv.org/abs/1706.03762).

* **input_ids**: 
The id from the pre-trained vocabulary that represents the token. (Padding of 0 will be used if the # of tokens is less than max_seq_length)

* **attention_mask**: 
Specifies which tokens should pay attention to (0 or 1). Padded input_ids will have 0 in each of these vector elements.

In [28]:
import psutil

notebook_memory = psutil.virtual_memory()
print(notebook_memory)

if notebook_memory.total < 32 * 1000 * 1000 * 1000:
    print('*******************************************')    
    print('YOU ARE NOT USING THE CORRECT INSTANCE TYPE')
    print('PLEASE CHANGE INSTANCE TYPE TO  m5.2xlarge ')
    print('*******************************************')
else:
    correct_instance_type=True

svmem(total=33229979648, available=20086059008, percent=39.6, used=12727783424, free=3382038528, active=14040580096, inactive=14074322944, buffers=2768896, cached=17117388800, shared=888832, slab=1194696704)


In [29]:
%store -r pretrained_model_checkpoint

In [30]:
try:
    pretrained_model_checkpoint
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the PREPARE section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [31]:
print(pretrained_model_checkpoint)

t5-base


In [32]:
%store -r dataset_templates_name

In [33]:
try:
    dataset_templates_name
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the PREPARE section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [34]:
print(dataset_templates_name)

amazon_us_reviews/Wireless_v1_00


In [35]:
%store -r prompt_template_name

In [36]:
try:
    prompt_template_name
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the PREPARE section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [37]:
print(prompt_template_name)

Generate review headline based on review body


In [55]:
import pandas as pd
from sklearn.model_selection import train_test_split
from pathlib import Path
import csv

def _transform_to_dataset(file, 
                          output_data, 
                          train_split_percentage, 
                          validation_split_percentage, 
                          test_split_percentage, 
                          model_checkpoint, 
                          dataset_templates_name, 
                          prompt_template_name):
    print("file {}".format(file))

    # Read the file
    df = pd.read_csv(file, delimiter="\t", quoting=csv.QUOTE_NONE, compression="gzip")

    df.isna().values.any()
    df = df.dropna()
    df = df.reset_index(drop=True)    
        
    # Split data    
    print("Shape of dataframe before splitting {}".format(df.shape))

    print("train split percentage {}".format(train_split_percentage))
    print("validation split percentage {}".format(validation_split_percentage))
    print("test split percentage {}".format(test_split_percentage))

    holdout_percentage = 1.00 - train_split_percentage
    print("validation holdout percentage {}".format(holdout_percentage))
    
    df_train, df_holdout = train_test_split(df, test_size=holdout_percentage)

    test_holdout_percentage = test_split_percentage / holdout_percentage
    
    print("test holdout percentage {}".format(test_holdout_percentage))
    
    df_validation, df_test = train_test_split(
        df_holdout, test_size=test_holdout_percentage)

    df_train = df_train.reset_index(drop=True)
    df_validation = df_validation.reset_index(drop=True)
    df_test = df_test.reset_index(drop=True)

    print("Shape of train dataframe {}".format(df_train.shape))
    print("Shape of validation dataframe {}".format(df_validation.shape))
    print("Shape of test dataframe {}".format(df_test.shape))
    
    # Convert Pandas dataframes into Datasets
    import datasets
    from datasets import Dataset

    # Create Dataset objects (Arrow PyTables) from Pandas dataframes
    dataset_train = Dataset.from_pandas(df_train)
    dataset_validation = Dataset.from_pandas(df_validation)
    dataset_test = Dataset.from_pandas(df_test)

    # Apply prompt  
    from promptsource.templates import DatasetTemplates
    prompt_templates = DatasetTemplates(dataset_templates_name) 
    
    for template in prompt_templates.templates.values():
        print(template.get_name())
    
    prompt = prompt_templates[prompt_template_name]
    print(prompt.answer_choices)    
    print(prompt.__dict__)
        
    dataset_train = dataset_train \
        .filter(lambda row: len(row['review_headline']) > 50) \
        .select(range(900)) \
        .map(lambda row : {'prompt': prompt.apply(row)[0] + '\n' + prompt.apply(row)[1] + '\n\n'})
    dataset_validation = dataset_validation \
        .filter(lambda row: len(row['review_headline']) > 50) \
        .select(range(50)) \
        .map(lambda row : {'prompt': prompt.apply(row)[0] + '\n' + prompt.apply(row)[1] + '\n\n'})
    dataset_test = dataset_test \
        .filter(lambda row: len(row['review_headline']) > 50) \
        .select(range(50)) \
        .map(lambda row : {'prompt': prompt.apply(row)[0] + '\n' + prompt.apply(row)[1] + '\n\n'})
                  
    # Tokenize    
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

    text_column_name = 'prompt'

    def tokenize_function(examples):        
        tokenized = tokenizer(examples[text_column_name])
        return tokenized

    import multiprocessing

    num_cpus = multiprocessing.cpu_count()
    print('num_cpus {}'.format(num_cpus))

    # if using .tsv, the data will have `product_category`, but not `year`:  https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
    # if using .parquet, the data will have also have `year`:  https://s3.amazonaws.com/amazon-reviews-pds/readme.html
    tokenized_dataset_train = dataset_train.map(tokenize_function, batched=True, num_proc=num_cpus, remove_columns=[
        'marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 'product_category',
        'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase',
        'review_headline', 'review_date', 'review_body', text_column_name]) # 'year'

    tokenized_dataset_validation = dataset_validation.map(tokenize_function, batched=True, num_proc=num_cpus, remove_columns=[
        'marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 'product_category',
        'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase',
        'review_headline', 'review_date', 'review_body', text_column_name]) # 'year'

    tokenized_dataset_test = dataset_validation.map(tokenize_function, batched=True, num_proc=num_cpus, remove_columns=[
        'marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 'product_category',
        'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase',
        'review_headline', 'review_date', 'review_body', text_column_name]) # 'year' 

    # Group into blocks and save to S3/disk

    block_size = 128

    def group_texts(examples):    
        # Concatenate all texts.
        concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
            # customize this part to your needs.
        total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        result["labels"] = result["input_ids"].copy()
        return result

    lm_dataset_train = tokenized_dataset_train.map(
        group_texts,
        batched=True,
        batch_size=10,
        num_proc=num_cpus,
    )
    
    lm_dataset_validation = tokenized_dataset_validation.map(
       group_texts,
       batched=True,
       batch_size=10,
       num_proc=num_cpus,
    )
    
    lm_dataset_test = tokenized_dataset_test.map(
       group_texts,
       batched=True,
       batch_size=10,
       num_proc=num_cpus,
    )
    
    filename_without_extension = Path(Path(file).stem).stem

    os.makedirs('{}/train/'.format(output_data), exist_ok=True)
    os.makedirs('{}/validation/'.format(output_data), exist_ok=True)
    os.makedirs('{}/test/'.format(output_data), exist_ok=True)
    
    lm_dataset_train.to_parquet('{}/train/{}.parquet'.format(output_data, filename_without_extension))    
    lm_dataset_validation.to_parquet('{}/validation/{}.parquet'.format(output_data, filename_without_extension))
    lm_dataset_test.to_parquet('{}/test/{}.parquet'.format(output_data, filename_without_extension))

In [56]:
import functools
import multiprocessing
import glob
import os

def process(args):

    input_files = glob.glob("{}/*.tsv.gz".format(args.input_data))
    print(input_files)

    print("Listing contents of {}".format(args.input_data))
    dirs_input = os.listdir(args.input_data)
    for file in dirs_input:
        print(file)

    train_data = "{}/train".format(args.output_data, args.model_checkpoint)
    validation_data = "{}/validation".format(args.output_data, args.model_checkpoint)
    test_data = "{}/test".format(args.output_data, args.model_checkpoint)

    transform_to_dataset = functools.partial(
        _transform_to_dataset,
        output_data=args.output_data,
        train_split_percentage=args.train_split_percentage, 
        validation_split_percentage=args.validation_split_percentage, 
        test_split_percentage=args.test_split_percentage,
        model_checkpoint=args.model_checkpoint,
        dataset_templates_name=args.dataset_templates_name,
        prompt_template_name=args.prompt_template_name
    )

    num_cpus = multiprocessing.cpu_count()
    print("num_cpus {}".format(num_cpus))

    p = multiprocessing.Pool(num_cpus)
    p.map(transform_to_dataset, input_files)

    print("Listing contents of {}".format(args.output_data))
    dirs_output = os.listdir(args.output_data)
    for file in dirs_output:
        print(file)

    print("Listing contents of {}".format(train_data))
    dirs_output = os.listdir(train_data)
    for file in dirs_output:
        print(file)

    print("Listing contents of {}".format(validation_data))
    dirs_output = os.listdir(validation_data)
    for file in dirs_output:
        print(file)

    print("Listing contents of {}".format(test_data))
    dirs_output = os.listdir(test_data)
    for file in dirs_output:
        print(file)


In [57]:
class Args:
    input_data: str
    output_data: str
    train_split_percentage: float
    validation_split_percentage: float
    test_split_percentage: float
    model_checkpoint: str
    dataset_templates_name: str
    prompt_template_name: str

args = Args()    


args.model_checkpoint = pretrained_model_checkpoint
args.dataset_templates_name = dataset_templates_name
args.prompt_template_name = prompt_template_name

args.input_data = './data-tsv'
args.output_data = './data'
args.train_split_percentage = 0.90
args.validation_split_percentage = 0.05
args.test_split_percentage = 0.05

process(args)

['./data-tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz', './data-tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz', './data-tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz']
Listing contents of ./data-tsv
amazon_reviews_us_Gift_Card_v1_00.tsv.gz
.ipynb_checkpoints
amazon_reviews_us_Digital_Software_v1_00.tsv.gz
amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
num_cpus 8
file ./data-tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz
file ./data-tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz
file ./data-tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
Shape of dataframe before splitting (149081, 15)
train split percentage 0.9
validation split percentage 0.05
test split percentage 0.05
validation holdout percentage 0.09999999999999998
test holdout percentage 0.5000000000000001
Shape of train dataframe (134172, 15)
Shape of validation dataframe (7454, 15)
Shape of test dataframe (7455, 15)
Shape of dataframe before splitting (102084, 15)
train split percentage 0.9
validatio

  0%|          | 0/135 [00:00<?, ?ba/s]

Generate review headline based on review body
Generate review based on rating and category
Given the review headline return a categorical rating
Generate review headline based on rating
Given the review body return a categorical rating
None
{'answer_choices': None, 'id': '5feaa0d7-e4e0-46cc-8517-e00bfa7fd00e', 'jinja': 'Give a short sentence describing the following product review:\n{{review_body}} \n|||\n{{review_headline}}', 'metadata': <promptsource.templates.Template.Metadata object at 0x7f0640bd31d0>, 'name': 'Generate review headline based on review body', 'reference': 'Generate review headline based on review body'}


  0%|          | 0/92 [00:00<?, ?ba/s]

Generate review headline based on review body
Generate review based on rating and category
Given the review headline return a categorical rating
Generate review headline based on rating
Given the review body return a categorical rating
None
{'answer_choices': None, 'id': '5feaa0d7-e4e0-46cc-8517-e00bfa7fd00e', 'jinja': 'Give a short sentence describing the following product review:\n{{review_body}} \n|||\n{{review_headline}}', 'metadata': <promptsource.templates.Template.Metadata object at 0x7f0627a94150>, 'name': 'Generate review headline based on review body', 'reference': 'Generate review headline based on review body'}


  0%|          | 0/131 [00:00<?, ?ba/s]

  0%|          | 0/900 [00:00<?, ?ex/s]

  0%|          | 0/900 [00:00<?, ?ex/s]

  0%|          | 0/900 [00:00<?, ?ex/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ex/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ex/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ex/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


num_cpus 8
        

  0%|          | 0/50 [00:00<?, ?ex/s]

     

  0%|          | 0/8 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (545 > 512). Running this sequence through the model will result in indexing errors


#4:   0%|          | 0/1 [00:00<?, ?ba/s]

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

#6:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (514 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (548 > 512). Running this sequence through the model will result in indexing errors


#7:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (582 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (571 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (598 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (556 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (520 > 512). Running this sequence through the model will result in indexing errors


  0%|          | 0/50 [00:00<?, ?ex/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


 num_cpus 8
                       

  0%|          | 0/8 [00:00<?, ?ba/s]

   

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

  

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (538 > 512). Running this sequence through the model will result in indexing errors


 

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (554 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (617 > 512). Running this sequence through the model will result in indexing errors


#5:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (550 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (584 > 512). Running this sequence through the model will result in indexing errors


#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ex/s]

                        

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

  

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

 

Token indices sequence length is longer than the specified maximum sequence length for this model (538 > 512). Running this sequence through the model will result in indexing errors


#4:   0%|          | 0/1 [00:00<?, ?ba/s]

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#6:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (554 > 512). Running this sequence through the model will result in indexing errors


#7:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

num_cpus 8
                                

#3:   0%|          | 0/12 [00:00<?, ?ba/s]

#2:   0%|          | 0/12 [00:00<?, ?ba/s]

#1:   0%|          | 0/12 [00:00<?, ?ba/s]

#4:   0%|          | 0/12 [00:00<?, ?ba/s]

#0:   0%|          | 0/12 [00:00<?, ?ba/s]

  

#5:   0%|          | 0/12 [00:00<?, ?ba/s]

#7:   0%|          | 0/12 [00:00<?, ?ba/s]

#6:   0%|          | 0/12 [00:00<?, ?ba/s]

       

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

    

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

  

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (588 > 512). Running this sequence through the model will result in indexing errors


#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (552 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (553 > 512). Running this sequence through the model will result in indexing errors


 

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (564 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (600 > 512). Running this sequence through the model will result in indexing errors


#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (589 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (550 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (680 > 512). Running this sequence through the model will result in indexing errors


                         

#4:   0%|          | 0/12 [00:00<?, ?ba/s]

#1:   0%|          | 0/12 [00:00<?, ?ba/s]

   

#6:   0%|          | 0/12 [00:00<?, ?ba/s]

#0:   0%|          | 0/12 [00:00<?, ?ba/s]

 

#5:   0%|          | 0/12 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/12 [00:00<?, ?ba/s]

#7:   0%|          | 0/12 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/12 [00:00<?, ?ba/s]

         

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

   

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

   

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

 

Token indices sequence length is longer than the specified maximum sequence length for this model (634 > 512). Running this sequence through the model will result in indexing errors


#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (544 > 512). Running this sequence through the model will result in indexing errors


  

#6:   0%|          | 0/1 [00:00<?, ?ba/s]

  

Token indices sequence length is longer than the specified maximum sequence length for this model (628 > 512). Running this sequence through the model will result in indexing errors


 

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

  

Token indices sequence length is longer than the specified maximum sequence length for this model (555 > 512). Running this sequence through the model will result in indexing errors


                  

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

  

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

  

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

          

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

   

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

          

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

   

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

   

Token indices sequence length is longer than the specified maximum sequence length for this model (544 > 512). Running this sequence through the model will result in indexing errors


#1:   0%|          | 0/1 [00:00<?, ?ba/s]

  

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

Token indices sequence length is longer than the specified maximum sequence length for this model (634 > 512). Running this sequence through the model will result in indexing errors


#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (628 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (555 > 512). Running this sequence through the model will result in indexing errors


Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

     

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

   

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

      

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

  

#1:   0%|          | 0/12 [00:00<?, ?ba/s]

#5:   0%|          | 0/12 [00:00<?, ?ba/s]

#2:   0%|          | 0/12 [00:00<?, ?ba/s]

#0:   0%|          | 0/12 [00:00<?, ?ba/s]

#7:   0%|          | 0/12 [00:00<?, ?ba/s]

#3:   0%|          | 0/12 [00:00<?, ?ba/s]

#6:   0%|          | 0/12 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

#4:   0%|          | 0/12 [00:00<?, ?ba/s]

                

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

                

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#6:   0%|          | 0/1 [00:00<?, ?ba/s]

#4:   0%|          | 0/1 [00:00<?, ?ba/s]

#5:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#7:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Listing contents of ./data
test
eval
.ipynb_checkpoints
train
validation
Listing contents of ./data/train
amazon_reviews_us_Gift_Card_v1_00.parquet
amazon_reviews_us_Digital_Video_Games_v1_00.parquet
amazon_reviews_us_Digital_Software_v1_00.parquet
Listing contents of ./data/validation
amazon_reviews_us_Gift_Card_v1_00.parquet
amazon_reviews_us_Digital_Video_Games_v1_00.parquet
amazon_reviews_us_Digital_Software_v1_00.parquet
Listing contents of ./data/test
amazon_reviews_us_Gift_Card_v1_00.parquet
amazon_reviews_us_Digital_Video_Games_v1_00.parquet
amazon_reviews_us_Digital_Software_v1_00.parquet


In [58]:
from datasets import Dataset

reloaded_dataset_train = Dataset.from_parquet('./data/train/*.parquet'.format(args.model_checkpoint))
reloaded_dataset_validation = Dataset.from_parquet('./data/validation/*.parquet'.format(args.model_checkpoint))
reloaded_dataset_test = Dataset.from_parquet('./data/test/*.parquet'.format(args.model_checkpoint))

Using custom data configuration default-18914fab36d418e5


Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-18914fab36d418e5/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Using custom data configuration default-9736e13a2f7a64e8


Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-18914fab36d418e5/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-9736e13a2f7a64e8/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Using custom data configuration default-79fce7a5e2cc99ed


Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-9736e13a2f7a64e8/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-79fce7a5e2cc99ed/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-79fce7a5e2cc99ed/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


In [59]:
reloaded_dataset_train

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 2944
})

In [60]:
reloaded_dataset_validation

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 146
})

In [61]:
reloaded_dataset_test

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 146
})

# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>