# Fine-tune our GPT3 model with the Amazon Customer Reviews Dataset

In [2]:
import psutil

notebook_memory = psutil.virtual_memory()

if notebook_memory.total < 32 * 1024 * 1024:
    print('*******************************************')    
    print('YOU ARE NOT USING THE CORRECT INSTANCE TYPE')
    print('PLEASE CHANGE INSTANCE TYPE TO  m5.2xlarge ')
    print('*******************************************')
else:
    correct_instance_type=True
    print(notebook_memory)

svmem(total=32890294272, available=29695799296, percent=9.7, used=2722099200, free=24881819648, active=3729616896, inactive=3195629568, buffers=0, cached=5286375424, shared=1335296, slab=450760704)


In [3]:
from datasets import Dataset

lm_dataset_train = Dataset.from_parquet('./data-gpt3/gpt3-train/*.parquet')
print(lm_dataset_train.shape)

Using custom data configuration default-6f6096bed7554cd8
Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/default-6f6096bed7554cd8/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


(137816, 3)


In [4]:
model_checkpoint = "bigscience/bloom-560m"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [5]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

Now that the data has been loaded, we're ready to instantiate our `Trainer`. We will retrieve our pre-trained model:

In [6]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

In [7]:
prompt = "Write a product review for Norton Antivirus"
result_length = 100
inputs = tokenizer(prompt, return_tensors='pt')

print(tokenizer.decode(model.generate(inputs["input_ids"], 
                       max_length=result_length, 
                       do_sample=True, 
                       top_k=50, 
                       top_p=0.9
                      )[0]))

Write a product review for Norton Antivirus, a great antivirus tool. This is a product review for Norton Antivirus, a great antivirus tool. Norton Antivirus is a free antivirus that is safe for all users of Windows 10. Norton Antivirus Free. Download free Norton Antivirus. Norton Antivirus is a free antivirus that is safe for all users of Windows 10. Norton Antivirus Free. Download free Norton Antivirus. Norton Antiv


In [8]:
prompt = "Write a product review for Turbo Tax"
result_length = 100
inputs = tokenizer(prompt, return_tensors='pt')

print(tokenizer.decode(model.generate(inputs["input_ids"], 
                       max_length=result_length, 
                       do_sample=True, 
                       top_k=50, 
                       top_p=0.9
                      )[0]))

Write a product review for Turbo Tax and you will be working with a very competent Tax Examiner who will assess and validate your company’s financial statements and business reports to determine the tax burden and compliance requirements of your company. We’ll also review the current legal position of your company and provide guidance and help on your company’s regulatory requirements.
We take pride in our high standards and are committed to making our customers proud of their success. Our experience and professionalism are evident in the quality of our reviews and


In [9]:
from transformers import TrainingArguments

model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    f"{model_name}-finetuned-amazon-customer-reviews",
#    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01, 
    max_steps=100,
    num_train_epochs=1.0,
    no_cuda=True    
)

We pass along all of those to the `Trainer` class:

In [10]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset_train
)

max_steps is given, it will override any value given in num_train_epochs


In [None]:
train_results = trainer.train()
train_results

# Save fine-tuned model

In [12]:
model_path = './tmp_models/gpt3_model/'

model.save_pretrained(model_path)

Configuration saved in ./tmp_models/gpt3_model/config.json
Configuration saved in ./tmp_models/gpt3_model/generation_config.json
Model weights saved in ./tmp_models/gpt3_model/pytorch_model.bin


# Generate text

In [13]:
import transformers
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(model_path)

loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--bigscience--bloom-560m/snapshots/e985a63cdc139290c5f700ff1929f0b5942cced2/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--bigscience--bloom-560m/snapshots/e985a63cdc139290c5f700ff1929f0b5942cced2/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--bigscience--bloom-560m/snapshots/e985a63cdc139290c5f700ff1929f0b5942cced2/tokenizer_config.json
loading configuration file ./tmp_models/gpt3_model/config.json
Model config BloomConfig {
  "_name_or_path": "./tmp_models/gpt3_model/",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "BloomForCausalLM"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_dropout": 0.0,
  "hidden_si

This model also supports many advanced parameters while performing inference including the following:

**max_length**: Model generates text until the output length (which includes the input context length) reaches max_length. If specified, it must be a positive integer.

**num_return_sequences**: Number of output sequences returned. If specified, it must be a positive integer.

**num_beams**: Number of beams used in the greedy search. If specified, it must be integer greater than or equal to num_return_sequences.

**no_repeat_ngram_size**: Model ensures that a sequence of words of no_repeat_ngram_size is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.

**temperature**: Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If temperature -> 0, it results in greedy decoding. If specified, it must be a positive float.

**early_stopping**: If True, text generation is finished when all beam hypotheses reach the end of stence token. If specified, it must be boolean.

**do_sample**: If True, sample the next word as per the likelyhood. If specified, it must be boolean.

**top_k**: In each step of text generation, sample from only the top_k most likely words. If specified, it must be a positive integer.

**top_p**: In each step of text generation, sample from the smallest possible set of words with cumulative probability top_p. If specified, it must be a float between 0 and 1.

**seed**: Fix the randomized state for reproducibility. If specified, it must be an integer.

In [14]:
prompt = "Write a review for Norton Antivirus"
result_length = 200
inputs = tokenizer(prompt, return_tensors='pt')

print(tokenizer.decode(model.generate(inputs["input_ids"], 
                       max_length=result_length, 
                       do_sample=True, 
                       top_k=50, 
                       top_p=0.9
                      )[0]))

Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 3,
  "transformers_version": "4.26.1"
}



Write a review for Norton Antivirus, as I downloaded it into my desktop and it was an excellent program. I was expecting better protection against viruses but it still is what I used.I was very pleased with the way Norton installed my program.  I would suggest that anyone looking to renew an Norton subscription should do it.  It is easier than purchasing a subscription from a different company.  Norton can work with many existing accounts as well as other accounts.  I would recommend it if the person renewing the subscription is unsure of what to do.  Amazon's site does not provide the most accurate information on renewal.I bought Norton Anti-Virus (download link) for a laptop, and since I have a Mac computer this version is perfect for the Mac.  I'm not a fan of Mac OS, so the Mac version of Norton is perfect.  I will probably be using Windows 8.1 for several years to be sure.  I'll probably get it again.<br /><br />The software does everything you need to


In [15]:
prompt = "Write a review for Turbo Tax"
result_length = 200
inputs = tokenizer(prompt, return_tensors='pt')

print(tokenizer.decode(model.generate(inputs["input_ids"], 
                       max_length=result_length, 
                       do_sample=True, 
                       top_k=50, 
                       top_p=0.9
                      )[0]))

Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 3,
  "transformers_version": "4.26.1"
}



Write a review for Turbo Tax?  Yes, I can easily find them.<br /><br />The service is reliable and easy to use.  You can easily send any type of tax return, including all the types of checks, as long as you have a computer with a good internet connection.<br /><br />The process for filing returns can take up to 3 hours depending on the amount of data.  I found that to be ok and didn't even have to worry about losing the file and reloading the file.I've been using TurboTax for years and its the best tax software available. I have used it for several years and have no problems with the software.  I have also used it at my other business.  I am satisfied with the service and the functionality I get when I use it.The most reliable option for those of us who are able to buy an IT subscription. I have been using TurboTax for almost 20 years and I use it to process all of my returns.  It has been a reliable


In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>