# Fine-tune our GPT3 model with the Amazon Customer Reviews Dataset

In [2]:
import psutil

notebook_memory = psutil.virtual_memory()

if notebook_memory.total < 32 * 1024 * 1024:
    print('*******************************************')    
    print('YOU ARE NOT USING THE CORRECT INSTANCE TYPE')
    print('PLEASE CHANGE INSTANCE TYPE TO  m5.2xlarge ')
    print('*******************************************')
else:
    correct_instance_type=True
    print(notebook_memory)

svmem(total=32890294272, available=30097223680, percent=8.5, used=2344468480, free=15083159552, active=4470759424, inactive=12267114496, buffers=0, cached=15462666240, shared=1138688, slab=749084672)


In [3]:
from datasets import Dataset

lm_dataset_train = Dataset.from_parquet('./data-gpt3/gpt3-train/*.parquet')
#lm_dataset_validation = Dataset.from_parquet('./data-gpt3/gpt3-validation/*.parquet')
#lm_dataset_test = Dataset.from_parquet('./data-gpt3/gpt3-test/*.parquet')

Using custom data configuration default-c267c09e3f5e7af2
Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/default-c267c09e3f5e7af2/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


In [4]:
model_checkpoint = "bigscience/bloom-560m"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [5]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

Now that the data has been loaded, we're ready to instantiate our `Trainer`. We will retrieve our pre-trained model:

In [6]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

In [7]:
prompt = "Write a product review for Norton Antivirus"
result_length = 100
inputs = tokenizer(prompt, return_tensors='pt')

print(tokenizer.decode(model.generate(inputs["input_ids"], 
                       max_length=result_length, 
                       do_sample=True, 
                       top_k=50, 
                       top_p=0.9
                      )[0]))

Write a product review for Norton Antivirus. If you are an administrator or if you are the owner, please provide your administrator credentials for the product reviews.</s>


In [8]:
prompt = "Write a product review for Turbo Tax"
result_length = 100
inputs = tokenizer(prompt, return_tensors='pt')

print(tokenizer.decode(model.generate(inputs["input_ids"], 
                       max_length=result_length, 
                       do_sample=True, 
                       top_k=50, 
                       top_p=0.9
                      )[0]))

Write a product review for Turbo Tax. The product review is not published in the original, but a copy is made available to the purchaser, to be published on the website of the manufacturer. The product review is an open-ended article. It will only contain an introduction to the product and a description of the benefits. In the end of the product review, the product is evaluated according to the manufacturer's recommended dosing method.
After the review, the customer will receive a confirmation e-mail that


In [10]:
from transformers import TrainingArguments

model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    f"{model_name}-finetuned-amazon-customer-reviews",
#    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01, 
    max_steps=100,
#    eval_steps=10, # skip validation for now
    num_train_epochs=1.0,
    no_cuda=True    
)

We pass along all of those to the `Trainer` class:

In [11]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset_train,
#    eval_dataset=lm_dataset_validation,
)

max_steps is given, it will override any value given in num_train_epochs


In [12]:
trainer.train()

***** Running training *****
  Num examples = 138117
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 100
  Number of trainable parameters = 559214592


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=100, training_loss=3.8183963012695314, metrics={'train_runtime': 3018.2433, 'train_samples_per_second': 0.265, 'train_steps_per_second': 0.033, 'total_flos': 185741397196800.0, 'train_loss': 3.8183963012695314, 'epoch': 0.01})

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

# Save fine-tuned model

In [13]:
model_path = './tmp_models/gpt3_model/'

model.save_pretrained(model_path)

Configuration saved in ./tmp_models/gpt3_model/config.json
Configuration saved in ./tmp_models/gpt3_model/generation_config.json
Model weights saved in ./tmp_models/gpt3_model/pytorch_model.bin


# Generate text

In [14]:
import transformers
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(model_path)

loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--bigscience--bloom-560m/snapshots/e985a63cdc139290c5f700ff1929f0b5942cced2/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--bigscience--bloom-560m/snapshots/e985a63cdc139290c5f700ff1929f0b5942cced2/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--bigscience--bloom-560m/snapshots/e985a63cdc139290c5f700ff1929f0b5942cced2/tokenizer_config.json
loading configuration file ./tmp_models/gpt3_model/config.json
Model config BloomConfig {
  "_name_or_path": "./tmp_models/gpt3_model/",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "BloomForCausalLM"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_dropout": 0.0,
  "hidden_si

This model also supports many advanced parameters while performing inference including the following:

**max_length**: Model generates text until the output length (which includes the input context length) reaches max_length. If specified, it must be a positive integer.

**num_return_sequences**: Number of output sequences returned. If specified, it must be a positive integer.

**num_beams**: Number of beams used in the greedy search. If specified, it must be integer greater than or equal to num_return_sequences.

**no_repeat_ngram_size**: Model ensures that a sequence of words of no_repeat_ngram_size is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.

**temperature**: Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If temperature -> 0, it results in greedy decoding. If specified, it must be a positive float.

**early_stopping**: If True, text generation is finished when all beam hypotheses reach the end of stence token. If specified, it must be boolean.

**do_sample**: If True, sample the next word as per the likelyhood. If specified, it must be boolean.

**top_k**: In each step of text generation, sample from only the top_k most likely words. If specified, it must be a positive integer.

**top_p**: In each step of text generation, sample from the smallest possible set of words with cumulative probability top_p. If specified, it must be a float between 0 and 1.

**seed**: Fix the randomized state for reproducibility. If specified, it must be an integer.

In [15]:
prompt = "Write a product review for Norton Antivirus"
result_length = 100
inputs = tokenizer(prompt, return_tensors='pt')

print(tokenizer.decode(model.generate(inputs["input_ids"], 
                       max_length=result_length, 
                       do_sample=True, 
                       top_k=50, 
                       top_p=0.9
                      )[0]))

Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 3,
  "transformers_version": "4.26.1"
}



Write a product review for Norton Antivirus, as I used it for years and still it was an excellent program. I was expecting this review to be a quick one but in fact it took up to 5 minutes! I have found Norton to be the best program.If you ever run into an Norton issue, there are solutions: 1. Click on the download icon, enter the downloaded Norton key and click &#34;Save As&#34;. The downloaded key should give you a shortcut for the Norton


In [16]:
prompt = "Write a product review for Turbo Tax"
result_length = 100
inputs = tokenizer(prompt, return_tensors='pt')

print(tokenizer.decode(model.generate(inputs["input_ids"], 
                       max_length=result_length, 
                       do_sample=True, 
                       top_k=50, 
                       top_p=0.9
                      )[0]))

Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 3,
  "transformers_version": "4.26.1"
}



Write a product review for Turbo Tax?  I am the first person to have used TurboTax and feel comfortable using it with Amazon.com.  I can easily write a product review for Amazon.com as I always have.  I would also recommend this program if you are using a Mac or Windows computer.I tried this program for many years and have used it on numerous computers for many years, and in most cases have been able to track the transactions.I was using a Windows 2010 and I had the
