Dependencies

In [2]:
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments, DataCollatorWithPadding

In [15]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/experimental-data/electronics_amazon.csv')
df = df.drop(columns=['Unnamed: 0'])
df

Unnamed: 0,product_title,star_rating,verified_purchase,review_headline,review_body
0,JVC Inner Ear Sports Clip Headphone,5,Y,They don't fall out,They don't fall out like other earbuds cause t...
1,APC Replacement RBC4 UPS battery [Electronics],1,Y,Shorted out my UPS,"Before I go on, let me explain that I replace ..."
2,Case Logic-1,4,Y,"Good, solid product",I'm not sure why they aren't making or selling...
3,Amplified HD Digital Outdoor HDTV Antenna 150 ...,5,Y,You need to mount the antenna above the top of...,I can pick up more than 30 channels. I do have...
4,"Wincor Nixdorf Switching AC/DC Adapter, Model:...",5,Y,Works like a champ.,This works just fine and does what you want it...
...,...,...,...,...,...
4995,Adapter HDMI Female to DVI Male Video Adapter,2,Y,Doesn't work well,The item itself came as advertised but the ite...
4996,Bose QuietComfort 15 Acoustic Noise Cancelling...,5,Y,After wanting them for years...worth the wait!,I travel a lot for work. I love listening to m...
4997,Case it 32-Capacity Molded CD/DVD Case Waterpr...,3,Y,Loved it but...,...the zipper pull broke the 2nd time I used i...
4998,DIRECTV RC66RX IR/RF Remote Control,5,Y,Exactly what I needed!,Remote worked great! All you have to do is set...


In [16]:
dataset = Dataset.from_dict(df)
dataset

Dataset({
    features: ['product_title', 'star_rating', 'verified_purchase', 'review_headline', 'review_body'],
    num_rows: 5000
})

In [17]:
dataset = dataset.class_encode_column("star_rating")


Stringifying the column:   0%|          | 0/5000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [18]:

# Splitting the dataset into training and testing sets
dataset = dataset.train_test_split(test_size=0.1, seed=42, stratify_by_column="star_rating")

train_dataset = dataset['train']
test_dataset = dataset['test']

In [19]:
!pip install sentencepiece



In [20]:
MODEL_NAME = 't5-base'
tokenizer = T5Tokenizer.from_pretrained('t5-base')

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [42]:
def preprocess_data(examples):
    title = examples['product_title']
    star_rating = examples['star_rating']
    review_headline = examples['review_headline']
    review_body = examples['review_body']
    prompt = []
    response = []
    for e in range(len(title)):
      p =  f"review: {title[e]}, {star_rating[e]} Stars!"
      r = f"{review_headline[0]} {review_body[0]}"
      prompt.append(p)
      response.append(r)

    examples['prompt'] = prompt
    examples['response'] = response

    inputs = tokenizer(examples['prompt'], padding='max_length', truncation=True, max_length=128)
    targets = tokenizer(examples['response'], padding='max_length', truncation=True, max_length=128)

    # Set -100 at the padding positions of target tokens
    target_input_ids = []
    for ids in targets['input_ids']:
        target_input_ids.append([id if id != tokenizer.pad_token_id else -100 for id in ids])

    inputs.update({'labels': target_input_ids})
    return inputs

In [43]:
train_dataset = train_dataset.map(preprocess_data, batched=True)
test_dataset = test_dataset.map(preprocess_data, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/4500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [44]:
# Fine-tuning the T5 model
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

TRAINING_OUTPUT = "/content/drive/MyDrive/Colab Notebooks/model/t5_fine_tuned_reviews"
training_args = TrainingArguments(
    output_dir=TRAINING_OUTPUT,
    num_train_epochs=3,
    per_device_train_batch_size=12,
    per_device_eval_batch_size=12,
    save_strategy='epoch',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

trainer.train()

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Step,Training Loss
500,0.3544
1000,0.0248


TrainOutput(global_step=1125, training_loss=0.1711387210422092, metrics={'train_runtime': 887.1823, 'train_samples_per_second': 15.217, 'train_steps_per_second': 1.268, 'total_flos': 2055232880640000.0, 'train_loss': 0.1711387210422092, 'epoch': 3.0})

In [45]:
# Saving the model
trainer.save_model(TRAINING_OUTPUT)

In [46]:
model_ = T5ForConditionalGeneration.from_pretrained(TRAINING_OUTPUT)


In [50]:
def generate_review(text):
    inputs = tokenizer("review: " + text, return_tensors='pt', max_length=512, padding='max_length', truncation=True)
    outputs = model_.generate(inputs['input_ids'], max_length=128, no_repeat_ngram_size=3, num_beams=6, early_stopping=True)
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

In [53]:
random_products = train_dataset.shuffle(42).select(range(10))['product_title']

print(random_products[0]+ ' '+generate_review(random_products[0] + ", 3 Stars!"))
print(generate_review(random_products[1] + ", 1 Stars!"))

AmazonBasics High Speed HDMI Cable Great while they lasted, one dead bud in 11 months I really liked these buds, they are comfortable and sound great for watching movies on the iPad. Unfortunately one bud died just short of a year. Not sure i would get another pair at this price.
Great while they lasted, one dead bud in 11 months I really liked these buds, they are comfortable and sound great for watching movies on the iPad. Unfortunately one bud died just short of a year. Not sure i would get another pair at this price.
