<a href="https://colab.research.google.com/github/desankha88/desankha88/blob/main/M4_AST_01_Finetune_GPT2_A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A Program by IISc and TalentSprint
### Assignment 1: Fine-tune GPT2

## Learning Objectives

At the end of the experiment, you will be able to:

* load and pre-process data from text file
* load and use a pre-trained tokenizer
* finetune a GPT-2 language model from Hugging Face's `transformers` library
* push the finetuned model to HuggingFace model hub
* load the finetuned model from hub for inference

## Dataset Description

The text data file is taken from The International Gita Society's eBook named "***BHAGAVAD-GITA*** Author: Sage Veda Vyasa", Translated in English by Ramananda Prasad, refer [here](https://www.gita-society.com/Read-bhagavad-gita.html).

It contains:

* the concept of duty and the moral implications of one's actions

* the importance of performing one's duty without attachment to the results

* various teachings, including the importance of performing one's duty according to one's station in life, the nature of the self, and the ultimate purpose of life

* guidance on how to live a righteous life, manage one's emotions, and make ethical decisions

* insights into achieving spiritual enlightenment and understanding one's true nature beyond the physical body

The text data is inside **`document.pdf`** and will be downloaded once the below setup cells are executed.

### **GPT-2**

In recent years, the OpenAI GPT-2 exhibited an impressive ability to write coherent and passionate essays that exceeded what current language models can produce. The GPT-2 wasn't a particularly novel architecture - its architecture is very similar to the **decoder-only transformer**. The GPT2 was, however, a very large, transformer-based language model trained on a massive dataset.

Here, we are going to fine-tune the GPT2 model with the text of International Gita Society's eBook - BHAGAVAD-GITA. We can expect that the model will be able to reply to the prompt related to the subject matter of this book after fine-tuning.

To know more about GPT-2, refer [here](http://jalammar.github.io/illustrated-gpt2/).

### Installing Dependencies

In [1]:
%%capture

# For loading models, tokenizers, and datasets from HuggingFace
!pip -q uninstall pyarrow -y
!pip -q install pyarrow==15.0.2
!pip -q install datasets
!pip -q install accelerate
!pip -q install transformers

# For reading text from PDF files
!pip -q install PyPDF2

### <font color="#990000">Restart Session/Runtime</font>

### Setup Steps:

In [2]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2418160" #@param {type:"string"}

In [3]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "7795047882" #@param {type:"string"}

In [4]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M4_AST_01_Finetune_GPT2_A" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")

    ipython.magic("sx wget https://drive.google.com/uc?id=12jYBY0yqwNEErkqol06BtEQF3C-wSUBS -O document.pdf")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://aimlops-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


### Importing required packages

In [5]:
import os
import re
import PyPDF2
import torch
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

import warnings
warnings.filterwarnings('ignore')

### Load the data

The data is in a PDF file (.pdf)

Create function to read pdf file:

In [6]:
# Function to read document pdf files

def read_pdf(pdf_path):
    text = ""

    # Open the PDF file
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)

        # Iterate over each page
        for page_num in range(len(reader.pages)):
            if page_num > 3:                         # extract text starting from page 5
                page = reader.pages[page_num]
                text += page.extract_text()

    return text


In [18]:
# Read files/documents

pdf_path = 'document.pdf'
text_file = read_pdf(pdf_path)# YOUR CODE HERE to read text using file path

In [19]:
print(text_file[:8000])

 
  
 
BHAGAVAD -GITA in ENGLISH  
Author: Sage Veda Vy asa 
Translat ed in English : Ramananda Prasad , Ph.D.  
Language Editor s: Needed  
Contact: rprasad@gita -society.com  
***** 
“Let noble thoughts come to us from everywhere”    
(The Vedas)  
 
INTRODUCTION  
The Bhagavad -Gita is a doctrine of universal truth  and a book 
of moral and spiritual growth . Its message is sublime and non -sec-
tarian . It deals with the most sacred metaphysi cal science. It im-
parts the knowledge of the Self and answers two universal ques-
tions: Who am I, and how can I lead a happy and peaceful life in 
this wor d full of dualities  and dilemmas ?  
It's a timeless book of wisdom  that inspired Thoreau, Emerson, 
Einstein, Oppenheimer, Gandhi and many others. The Bhagavad -
Gita teaches us how  to equip ourselves  for the battle of life. A re-
peated study with faith purifies our psyche and guides us to face 
the challenges of modern livin g leading to inner peace and happi-
ness.  
Gita teaches

### Pre-processing

- Remove any excess newline characters from the text
- Remove any excess spaces
- Remove unnecessary words (Header & Page number)
- Keep 100 words per line inside text

In [20]:
# Remove excess newline characters

# YOUR CODE HERE
text_file = re.sub(r'\n+', '\n', text_file).strip()

In [21]:
# Remove excess spaces

# YOUR CODE HERE
text_file = re.sub(r' +', ' ', text_file).strip()

In [22]:
# Remove unnecessary words (Header & Page number)
text_file = re.sub(r' \d+ International Gita Society', '', text_file)
text_file = re.sub(r' Bhagavad -Gita \d+', '', text_file)

In [23]:
print(text_file[:8000])

BHAGAVAD -GITA in ENGLISH 
Author: Sage Veda Vy asa 
Translat ed in English : Ramananda Prasad , Ph.D. 
Language Editor s: Needed 
Contact: rprasad@gita -society.com 
***** 
“Let noble thoughts come to us from everywhere” 
(The Vedas) 
 
INTRODUCTION 
The Bhagavad -Gita is a doctrine of universal truth and a book 
of moral and spiritual growth . Its message is sublime and non -sec-
tarian . It deals with the most sacred metaphysi cal science. It im-
parts the knowledge of the Self and answers two universal ques-
tions: Who am I, and how can I lead a happy and peaceful life in 
this wor d full of dualities and dilemmas ? 
It's a timeless book of wisdom that inspired Thoreau, Emerson, 
Einstein, Oppenheimer, Gandhi and many others. The Bhagavad -
Gita teaches us how to equip ourselves for the battle of life. A re-
peated study with faith purifies our psyche and guides us to face 
the challenges of modern livin g leading to inner peace and happi-
ness. 
Gita teaches the spiritual science 

- Keep 100 words per line inside text

In [24]:
len(text_file)

131723

In [28]:
# Initialize an empty list to temporarily store words and a string to store the reformatted text
word_list = []  # This will store words in batches of 100
new_text_file = ''  # This will store the final reformatted text with line breaks

# Iterate through each line of the input text file
for line in text_file.split('\n'):
    # Split the line into individual words
    words = line.split()

    # Iterate through each word in the current line
    for word in words:
        word_list.append(word)  # Add the word to the temporary list

        # Check if the word list contains 100 words
        if len(word_list) == 100:
            # Join the 100 words into a single line, add a newline character, and append to the new text
            new_text_file += ' '.join(word_list) + '\n'
            # Reset the word list for the next batch
            word_list = []

# If there are remaining words in the word list after processing all lines
if word_list:
    # Join the remaining words and add them as the final line in the new text
    new_text_file += ' '.join(word_list) + '\n'

In [25]:
new_text_file = ''

# YOUR CODE HERE... to code the logic to keep 100 words per line from 'text_file' and save it in 'new_text_file'

remaining_words_list = []

for line in text_file.split('\n'):
    wordList = []
    wordlist = line.split()
    new_line = ' '.join(wordlist[:100]) + '\n'
    new_text_file = new_text_file  + new_line
    remaining_words_list += wordlist[100:]

new_text_file = new_text_file  + '\n' + ' '.join(remaining_words_list)
len(new_text_file)

129838

In [29]:
print(new_text_file[:8000])

BHAGAVAD -GITA in ENGLISH Author: Sage Veda Vy asa Translat ed in English : Ramananda Prasad , Ph.D. Language Editor s: Needed Contact: rprasad@gita -society.com ***** “Let noble thoughts come to us from everywhere” (The Vedas) INTRODUCTION The Bhagavad -Gita is a doctrine of universal truth and a book of moral and spiritual growth . Its message is sublime and non -sec- tarian . It deals with the most sacred metaphysi cal science. It im- parts the knowledge of the Self and answers two universal ques- tions: Who am I, and how can I lead a happy and peaceful life
in this wor d full of dualities and dilemmas ? It's a timeless book of wisdom that inspired Thoreau, Emerson, Einstein, Oppenheimer, Gandhi and many others. The Bhagavad - Gita teaches us how to equip ourselves for the battle of life. A re- peated study with faith purifies our psyche and guides us to face the challenges of modern livin g leading to inner peace and happi- ness. Gita teaches the spiritual science of Self -realizat

In [30]:
len(new_text_file.split('\n')[0].split())

100

### Split the text into training and validation sets

In [31]:
# Split the text into training and validation sets

train_fraction = 0.8
split_index = int(train_fraction * len(new_text_file))

train_text = new_text_file[:split_index]

val_text =  new_text_file[split_index:] # YOUR CODE HERE to get remaining text_file content

In [32]:
len(train_text)

103812

In [33]:
# Save the training and validation data as text files

with open("train.txt", "w") as f:
    f.write(train_text)

# YOUR CODE HERE to save 'val_text' data into a text file 'val.txt'
with open("val.txt", "w") as f:
    f.write(val_text)

### Load pre-trained tokenizer - GP2Tokenizer

The GPT2Tokenizer is based on ***Byte-Pair-Encoding***.

Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model.

In BPE, new tokens are added until the desired vocabulary size is reached by learning ***merges***, which are rules to merge two elements of the existing vocabulary together into a new one.

Below figure shows how the vocabulary updates as the BPE algorithm progresses.

<br>
<center>
<img src="https://cdn.iisc.talentsprint.com/AIandMLOps/Images/Byte-pair-encoding.png" width=450px>
</center>

To know more about Byte-Pair Encoding, refer [here](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt#byte-pair-encoding-tokenization).

<br>

Some of the parameters required to create a GP2Tokenizer includes:

- ***vocab_file (str):*** path to the vocabulary json file; maps token to integer ids

- ***merges_file (str):*** path to the ***merges*** file; contains the merge rule; The merge rule file should have one merge rule per line. Every merge rule contains merge entities separated by a space.



Here, we will instantiate a GPT-2 tokenizer from a predefined tokenizer using `from_pretrained()` method.

It includes a parameter:

- ***pretrained_model_name_or_path:*** It can be a string of a predefined tokenizer hosted inside a model repo on huggingface.co.

    For example: *gpt2, gpt2-medium, gpt2-large, or gpt2-xl*

    This will download the corresponding vocab, merges, and config files.

In [34]:
# Set up the tokenizer
checkpoint = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint) # YOUR CODE HERE to load GPT2Tokenizer using 'checkpoint'           # also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl

# set pad_token_id to unk_token_id
tokenizer.pad_token = tokenizer.unk_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [35]:
# Tokenize sample text using GP2Tokenizer
sample_ids = tokenizer("Hello world")
sample_ids

{'input_ids': [15496, 995], 'attention_mask': [1, 1]}

In [36]:
# Generate tokens for sample text
sample_tokens = tokenizer.convert_ids_to_tokens(sample_ids['input_ids'])
sample_tokens

['Hello', 'Ġworld']

In [37]:
# Generate original text back
tokenizer.convert_tokens_to_string(sample_tokens)

'Hello world'

### Tokenize text data

In [38]:
from datasets import load_dataset

In [39]:
train_file_path = 'train.txt'
val_file_path = 'val.txt'

dataset = load_dataset("text", data_files={"train": train_file_path,
                                           "validation": val_file_path})

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [40]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 185
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 46
    })
})

In [41]:
dataset['train']['text'][0]

'BHAGAVAD -GITA in ENGLISH Author: Sage Veda Vy asa Translat ed in English : Ramananda Prasad , Ph.D. Language Editor s: Needed Contact: rprasad@gita -society.com ***** “Let noble thoughts come to us from everywhere” (The Vedas) INTRODUCTION The Bhagavad -Gita is a doctrine of universal truth and a book of moral and spiritual growth . Its message is sublime and non -sec- tarian . It deals with the most sacred metaphysi cal science. It im- parts the knowledge of the Self and answers two universal ques- tions: Who am I, and how can I lead a happy and peaceful life'

In [42]:
block_size = 256     # max tokens in an input sample

def tokenize_function(examples):
    return tokenizer(examples["text"], padding='max_length', truncation=True, max_length=block_size, return_tensors='pt')

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/185 [00:00<?, ? examples/s]

Map:   0%|          | 0/46 [00:00<?, ? examples/s]

In [43]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 185
    })
    validation: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 46
    })
})

In [44]:
len(tokenized_datasets['train']['input_ids'][0])

256

In [45]:
tokenizer.decode(tokenized_datasets['train']['input_ids'][0])

'BHAGAVAD -GITA in ENGLISH Author: Sage Veda Vy asa Translat ed in English : Ramananda Prasad , Ph.D. Language Editor s: Needed Contact: rprasad@gita -society.com ***** “Let noble thoughts come to us from everywhere” (The Vedas) INTRODUCTION The Bhagavad -Gita is a doctrine of universal truth and a book of moral and spiritual growth . Its message is sublime and non -sec- tarian . It deals with the most sacred metaphysi cal science. It im- parts the knowledge of the Self and answers two universal ques- tions: Who am I, and how can I lead a happy and peaceful life<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|e

### Data Collator

Data collators are objects that:

- will form a batch by using a list of dataset elements as input
- may apply some processing (like padding)

One of the data collators, `DataCollatorForLanguageModeling`, can also apply some random data augmentation (like random masking) on the formed batch.

<br>

`DataCollatorForLanguageModeling` is a data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

Parameters:

- ***tokenizer:*** The tokenizer used for encoding the data.
- ***mlm*** (bool, optional, default=True): Whether or not to use masked language modeling.
    - If set to False, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100).
    - Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.
- ***return_tensors*** (str): The type of Tensor to return. Allowable values are “np”, “pt” and “tf” for numpy array, pytorch tensor, and tensorflow tensor respectively.

To know more about `DataCollatorForLanguageModeling` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling).

In [46]:
# Create a Data collator object
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")

### Load pre-trained Model

***GPT2LMHeadModel*** is the GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).

This model is a PyTorch `torch.nn.Module` subclass which can be used as a regular PyTorch Module.

Parameters:

- ***config (GPT2Config):*** Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration.

Here, we will instantiate a pretrained pytorch model from a pre-trained model configuration, using `from_pretrained()` method, that will load the weights associated with the model.

In [None]:
# Set up the model
model = GPT2LMHeadModel.from_pretrained(checkpoint)# YOUR CODE HERE to load GPT2LMHeadModel using 'checkpoint'             # also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl

**Note: The training time for different GPT models with GPU for this dataset are as follows:**

* **GPT-2 : ~25 minutes for 100 epochs**

* **GPT-2 Medium:  ~1 hour for 100 epochs**

* **GPT-2 Large : Run out of memory**

### Fine-tune Model *(Switch to GPU runtime if needed)*

Train a GPT-2 model using the provided training arguments. Save the resulting trained model and tokenizer to a specified output directory.

The `Trainer` class provides an API for feature-complete training in PyTorch for most standard use cases.

Before instantiating your Trainer, create a `TrainingArguments` to access all the points of customization during training.

`TrainingArguments` parameters:

- ***output_dir*** (str): The output directory where the model predictions and checkpoints will be written.
- ***overwrite_output_dir*** (bool, optional, default=False): If True, overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.
- ***per_device_train_batch_size*** (int, optional, default=8): The batch size per GPU/TPU/MPS/NPU core/CPU for training.
- ***per_device_eval_batch_size*** (int, optional, default=8): The batch size per GPU/TPU/MPS/NPU core/CPU for evaluation.
- ***save_total_limit*** (int, optional): If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir.

To know more about `TrainingArguments` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/trainer#transformers.TrainingArguments).

To know more about `Trainer` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/trainer#transformers.Trainer).

In [None]:
# Set up the training arguments

model_output_path = "/content/gpt2_model"

training_args = # YOUR CODE HERE to create TrainingArguments object with
                    # model_output_path as output directory,
                    # overwrite the content of output directory,
                    # use 4 as batch size per device for training and evaluation,
                    # num_train_epochs = 100,
                    # save_steps = 1_000,
                    # save_total_limit = 2,
                    # logging_dir = './logs')

In [None]:
# Train the model
trainer = # YOUR CODE HERE to create Trainer object with
                # model = model,
                # args = training_args,
                # data_collator = data_collator,
                # train_dataset = train_dataset,
                # eval_dataset = val_dataset)

In [None]:
# Disabling Weights and Biases logging
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
trainer.train()

In [None]:
# Save the model
saved_model_path = "/content/finetuned_gpt2_model"
trainer.save_model(saved_model_path)

# Save the tokenizer
# YOUR CODE HERE to save 'tokenizer' using save_pretrained() method at path model_output_path

### Test Model with user input prompts

##### Now, let us test the model with some prompt


The `generate_response()` function takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model.

In [None]:
def generate_response(model, tokenizer, prompt, max_length=200):

    input_ids = tokenizer.encode(prompt, return_tensors="pt")      # 'pt' for returning pytorch tensor

    # Check the device of the model
    device = next(model.parameters()).device

    # Move input_ids to the same device as the model
    input_ids = input_ids.to(device)

    # Create the attention mask and pad token id
    attention_mask = torch.ones_like(input_ids)
    pad_token_id = tokenizer.eos_token_id

    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        attention_mask=attention_mask,
        pad_token_id=pad_token_id
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)


In [None]:
# Load the fine-tuned model and tokenizer

my_model = # YOUR CODE HERE to load model from saved_model_path
my_tokenizer = # YOUR CODE HERE to load tokenizer from saved_model_path

In [None]:
# Testing

prompt = "How can one live a righteous life?"           # Replace with your desired prompt
response = generate_response(model, tokenizer, prompt)
print("Generated response:")
response

In [None]:
# Testing

prompt = "What is the purpose of life?"           # Replace with your desired prompt
response = # YOUR CODE HERE


In [None]:
# Testing
prompt = "What is Karma?"          # Replace with your desired prompt
response = # YOUR CODE HERE


In [None]:
# Testing
prompt = "How to overcome dilemma or fear?"          # Replace with your desired prompt
response = # YOUR CODE HERE


In [None]:
# Testing
prompt = "How to control emotions during tough times?"          # Replace with your desired prompt
response = # YOUR CODE HERE


In [None]:
# Testing
prompt = "Is there a way to achieve enlightenment?"          # Replace with your desired prompt
response = # YOUR CODE HERE


In the case of the GPT-2 tokenizer, the model uses a byte-pair encoding (BPE) algorithm, which tokenizes text into subword units. As a result, one word might be represented by multiple tokens.

For example, if you set max_length to 50, the generated response will be limited to 50 tokens, which could be fewer than 50 words, depending on the text.

## Push your fine-tuned model to HuggingFace Model Hub

**Steps to push your fine-tuned model to HuggingFace Model Hub**

1. [Sign up](https://huggingface.co/join) for a Hugging Face account
2. Create an access token for your account and save it
3. Store your access token in the Hugging Face cache folder within colab
4. Push your fine-tuned model and tokenizer to Model Hub
5. Load the model back from Hub and test it with user input prompts

* **Create an access token for your account**

    Once you have an account, to create an access token:
    
    - Go to your `Settings`, then click on the `Access Tokens` tab. Click on the `New token` button to create a new User Access Token.
    - Select a Token type as `Write` and give a name for your token
    - Click on Create token
    - Once a token is created save it somewhere
    - When required later, use the old saved token or create a new token again

    To know more about Access Tokens, refer [here](https://huggingface.co/docs/hub/security-tokens).

* **Store your access token in the Hugging Face cache folder within colab**

    Once you have your User Access Token, run the following command to authenticate your identity to the Hub.
    - `!huggingface-cli login --token YOUR_TOKEN_NAME`
    - Add your token name
    - If you want to save the token to the Git credentials helper, you can modify your command to include the `--add-to-git-credential flag`:
     
      `!huggingface-cli login --token your_token_here --add-to-git-credential`


  For more details on login, refer [here](https://huggingface.co/docs/huggingface_hub/quick-start#login).

In [None]:
!huggingface-cli login --token YOUR_TOKEN_HERE

* **Push your fine-tuned model and tokenizer to Model Hub**

    - Use `push_to_hub()` method of your model and tokenizer both, to push them on hub
    - Specify name for your repository where the model and tokenizer will be pushed using `repo_id` parameter
    - Push model and tokenizer to the same repository

        - Use `push_to_hub()` method of your model. For parameter details, refer [here](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.push_to_hub).
        - Use `push_to_hub()` method of your tokenizer. For parameter details, refer [here](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.push_to_hub).
        - Access your pushed model at `https://huggingface.co/[YOUR-USER-NAME]/[YOUR-MODEL-REPO-NAME]/tree/main`

In [None]:
# Push model
my_repo = "gita-text-generation-gpt2"
model.push_to_hub(repo_id= my_repo, commit_message= "Upload fine-tuned model")

In [None]:
# Push tokenizer
tokenizer.push_to_hub(repo_id= my_repo, commit_message= "Upload tokenizer used")

Access your pushed model at `https://huggingface.co/[YOUR-USER-NAME]/[YOUR-MODEL-REPO-NAME]/tree/main`

For example: https://huggingface.co/yrajm1997/gita-text-generation-gpt2/tree/main

* **Load the model and tokenizer back from Hub and test it with user input prompts**

    - In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the `from_pretrained()` method. **AutoClasses** can be used to automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.

    - Instantiating one of `AutoConfig`, `AutoModel`, and `AutoTokenizer` will directly create a class of the relevant architecture.

    - When the GPT2 Model transformer has a language modeling head on top, you can use an auto class with language modeling head on top as well - `AutoModelWithLMHead`.

    - Specify full path of your model repo i.e. ***''YOUR-USER-NAME/YOUR-REPO-NAME''*** while calling `from_pretrained()` method.

In [None]:
from transformers import AutoModelWithLMHead, AutoTokenizer

In [None]:
username = "yrajm1997"      # change it to your HuggingFace username

my_checkpoint = username + '/' + my_repo       # eg. "yrajm1997/gita-text-generation-gpt2"
my_checkpoint

In [None]:
# Load your model from hub
loaded_model = AutoModelWithLMHead.from_pretrained(my_checkpoint)

In [None]:
# Load your tokenizer from hub
loaded_tokenizer = AutoTokenizer.from_pretrained(my_checkpoint)

In [None]:
# Testing

prompt = "How can one live a righteous life?"           # Replace with your desired prompt
response = generate_response(loaded_model, loaded_tokenizer, prompt)
print("Generated response:")
response

### Please answer the questions below to complete the experiment:




In [None]:
#@title The architecture of GPT is very similar to: { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "" #@param ["", "the encoder-only transformer", "the decoder-only transformer", "the encoder-decoder transformer", "none of the above"]

In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]

In [None]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}

In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["","Yes", "No"]

In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]

In [None]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]

In [None]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")