![GPT-Chat Banner](https://openfileserver.chloelavrat.com/workshops/RadioGPT/assets/gpt-chat-banner.png)



> 💡 **PLEASE CONNECT USING A GPU SESSION FOR MORE COMPUTE POWER** :
>
> `Runtime > Change runtime type > T4 GPU > Save`


# 🎯 Objective


The goal of this second part of the workshop is to explore how to interact with and experiment using a pre-trained large language model. Through hands-on experience with a small but still larger than previously pre-trained model (83M parameters). In this section you will discover:

- How to load and work with a pre-trained language model.
- The structure of chatbot datasets and how to leverage them for generating responses.
- The effects of adjusting model parameters like temperature, prompt design, and token limits.
- The challenges of training and using very small models in real-world applications, including biases and overfitting.
- How to explore and critique the model's responses to identify areas for improvement.

By the end of this phase, you will have experimented with generating text, gained insight into model behavior, and understood the trade-offs involved in working with smaller models. 🚀

# 🎬 Notebook initialization

In [2]:
!git clone https://github.com/chloelavrat/RadioGPT.git > /dev/null 2>&1
!cd RadioGPT && git checkout clavrat/first-version > /dev/null 2>&1
!pip install torch datasets tqdm transformers > /dev/null 2>&1

# 📀 Dataset Overview

The French Alpaca Dataset contains **110,368 instruction-based examples** in French in the Alpaca format. The dataset was specifically created to fine-tune general language models for tasks like question-answering and text generation.

**Here’s the twist:** this dataset was entirely generated by GPT-3.5-turbo itself! In other words, an LLM was used to create data that would train... another LLM. It's a fascinating example of AI systems helping to improve their own performance, all while producing a diverse set of French instructions that can be used for fine-tuning and model experimentation.

🔗 [Explore the dataset on Hugging Face](https://huggingface.co/datasets/jpacifico/French-Alpaca-dataset-Instruct-110K)

In [3]:
#@title Download French Alpaca dataset
!mkdir dataset
!wget https://openfileserver.chloelavrat.com/workshops/RadioGPT/dataset/Acquiesce_data_110k_instructions.json -O dataset/Acquiesce_data_110k_instructions.json
print("Dataset downloaded")

--2024-12-14 15:48:10--  https://openfileserver.chloelavrat.com/workshops/RadioGPT/dataset/Acquiesce_data_110k_instructions.json
Resolving openfileserver.chloelavrat.com (openfileserver.chloelavrat.com)... 149.202.72.149
Connecting to openfileserver.chloelavrat.com (openfileserver.chloelavrat.com)|149.202.72.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51958855 (50M) [application/json]
Saving to: ‘dataset/Acquiesce_data_110k_instructions.json’


2024-12-14 15:48:12 (23.9 MB/s) - ‘dataset/Acquiesce_data_110k_instructions.json’ saved [51958855/51958855]

Dataset downloaded


In [4]:
#@title Load Alpaca dataset
from RadioGPT.gptmodel.core.dataset import AlpacaDataset

block_size = 64 # you remember, "sets the length of input sequences the model processes at once"

# Dataset
dataset = AlpacaDataset("dataset/Acquiesce_data_110k_instructions.json", block_size)

print("Vocabulary size   : ", dataset.vocab_size)
print("Batch token       : \n", dataset.get_batch(1))
print("Batch decoded     : \n", dataset.decode(dataset.get_batch(1)[0][0].tolist()))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Loaded 110368 conversations
Maximum sequence length: 64
Vocabulary size   :  50257
Batch token       : 
 (tensor([[24361,    25,   360,  2634,    66,  2442,   551, 39073, 13199,  2912,
          1500,   622,   557, 17809, 17266,  1653,   551,  1489,   271, 28141,
           636,   343,   390,  1976,  2634,   305,    13,   198, 33706,    25,
          4689,  5103,   288,     6,  1726, 17266,  1653,   551,  1489,   271,
         28141,   636,   343,   390,  1976,  2634,   305,  4114,  2350,  5556,
           494,  1834,   220, 25125,  7916,    13,   309,   448,   288,     6,
           397,   585,    11]]), tensor([[   25,   360,  2634,    66,  2442,   551, 39073, 13199,  2912,  1500,
           622,   557, 17809, 17266,  1653,   551,  1489,   271, 28141,   636,
           343,   390,  1976,  2634,   305,    13,   198, 33706,    25,  4689,
          5103,   288,     6,  1726, 17266,  1653,   551,  1489,   271, 28141,
           636,   343,   390,  1976,  2634,   305,  4114,  2350,  5556,

# 🧠 Model

In [5]:
#@title Loading ...
# Load RadioGPT's checkpoint :)
from RadioGPT.gptmodel.core.model import GPTlite
from RadioGPT.gptmodel.core.utils import load_model
import os, subprocess, torch
# get device
device = (
    torch.device("mps") if torch.backends.mps.is_available() else
    torch.device("cuda") if torch.cuda.is_available() else
    torch.device("cpu")
)

# downloading model
print("Downloading model...")
os.makedirs("models", exist_ok=True)
subprocess.run(["wget", "https://openfileserver.chloelavrat.com/workshops/RadioGPT/models/model_gpt_chat_best.pth", "-O", "models/model_gpt_chat_best.pth"])

def load_model(model_path, device, config):
    # Load the model checkpoint
    checkpoint = torch.load(
        model_path, map_location=device, weights_only=False)

    model = GPTlite(config)
    model.load_state_dict(checkpoint)

    return model

# Hyperparameters
block_size = 64
n_embd = 512
n_head = 8
n_layer = 10
dropout = 0.2

config = {
    'context_size': block_size,
    'vocab_size': dataset.vocab_size,
    'embedding_dim': n_embd,
    'num_heads': n_head,
    'num_layers': n_layer,
    'dropout': dropout
}

# Loading model in memory
print("Loading model...")
model = load_model("models/model_gpt_chat_best.pth", device, config)
model = model.to(device)
print("Model loaded !")

# Get the total number of parameters in the model
total_params = sum(p.numel() for p in model.parameters())

print(f"You will use your {device}")
print(f"Total number of parameters   {total_params / 1e6:.1f}M")

Downloading model...
Loading model...
Model loaded !
You will use your cpu
Total number of parameters   83.1M


As you can see below this model version features a lot more parameters resulting in a larger model of **83.1M** Parameters


# 🎨 Let's Play

In [6]:
#@title Selected prompt

# play with the model
def generate_response(model, dataset, prompt, device, max_new_tokens):
    model.eval()
    # Encode the prompt
    input_tensor = dataset.encode(prompt).unsqueeze(0).to(device)

    # Generate text using the model's generate method
    with torch.no_grad():
        generated_indices = model.generate(input_tensor, max_new_tokens)
        generated_text = dataset.decode(generated_indices[0].tolist())

    # Return only the newly generated part (after the prompt)
    return generated_text[len(prompt):]


prompt = 'Prépare une recette de pâtes à la carbonara.' # @param ["Prépare une recette de pâtes à la carbonara.", "Quel est l'élément chimique avec le numéro atomique 29 ?", "Rédige un court paragraphe sur le thème de l'amitié et de la confiance."]
max_new_tokens = 106 # @param {type:"slider", min:5, max:500, step:1}


prompt = f"Question: {prompt}\nAnswer:"
prompt = prompt + generate_response(model, dataset, prompt, device, max_new_tokens)
print(prompt)

Question: Prépare une recette de pâtes à la carbonara.
Answer: Ingrédients : - 400 g de spaghetti - 150 g de pancetta - 3 œufs - 100 g de parmesan râpé - 50 g de Sel et de poivante extraiens de la vanille indiffelques demain.




In [7]:
#@title Let's prompt it!

# play with the model
def generate_response(model, dataset, prompt, device, max_new_tokens):
    model.eval()
    # Encode the prompt
    input_tensor = dataset.encode(prompt).unsqueeze(0).to(device)

    # Generate text using the model's generate method
    with torch.no_grad():
        generated_indices = model.generate(input_tensor, max_new_tokens)
        generated_text = dataset.decode(generated_indices[0].tolist())

    # Return only the newly generated part (after the prompt)
    return generated_text[len(prompt):]


prompt = 'Décris les différences entre le modèle GPT-2 et le modèle GPT-3.' # @param {type:"string"}
max_new_tokens = 180 # @param {type:"slider", min:5, max:500, step:1}


prompt = f"Question: {prompt}\nAnswer:"
prompt = prompt + generate_response(model, dataset, prompt, device, max_new_tokens)
print(prompt)

Question: Décris les différences entre le modèle GPT-2 et le modèle GPT-3.
Answer: Le modèle GPT-2 est un modèle de langage généré développé par OpenAI. Il est basse de divers actuellement long terme, car el climat a menaccent mis nombre de nombreux our men parcourives. As scientif� vol extra- nice On my




# ✨ **Conclusion**

This exercise demonstrates both the potential and the limitations of working with smaller models. Despite having only **83.1M parameters**, the model is able to generate coherent sentence structures. However, as seen in the example, its responses can lack meaningful content and are prone to producing nonsensical phrases.

>**Example**  
>**Question:** *Décris les différences entre le modèle GPT-2 et le modèle GPT-3*  
>**Answer:** *Le modèle GPT-2 est un modèle de langage généré développé par OpenAI. Il est basse de pays ose ter voliance par le cad de pays membre longue de bell. Ass end quotid men endse animerie longues de nombreonde extrautes vert endale du musée nume numaraccent.*

While the grammar and structure seem correct, the generated response drifts into gibberish, showcasing the trade-offs between model size and capability. This highlights why larger models are often necessary for more accurate and contextually aware language generation, as they have more capacity to represent complex relationships within the data.

In the next section, a fine-tuned LLM will be used to explore how training on domain-specific data can significantly improve the quality of responses. Let’s take this learning further! 🚀