# Finetuning de gpt2 pour reconnaître le français

Ce projet de chatbot IA a pour but d'être intégré dans mon portfolio et être mis à disposition des futurs visiteurs. Le dataset est générés par ChatGPT et ne doit être pris comme étant mon activité mais représente une tentative de finetuner gpt2 pour la langue française.

Sources : 

- [Fine tuning gpt2](https://www.youtube.com/watch?v=elUCn_TFdQc)
- [gpt2](https://huggingface.co/gpt2)

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from torch.utils.data import Dataset
import json
from torch.optim import Adam
from torch.utils.data import DataLoader
import tqdm
import torch

### Préparation de la data

In [124]:
class ChatData(Dataset):
    def __init__(self, path:str, tokenizer):
        self.data = json.load(open(path, "r"))

        self.X = []
        for i in self.data:
            for j in i['dialog']:
                self.X.append(j['text'])

        for idx, i in enumerate(self.X):
            try:
                self.X[idx] = "<startofstring> "+i+" <bot>: "+self.X[idx+1]+" <endofstring>"
            except:
                break

        self.X = self.X[:3000]
        
        print(self.X[0])

        self.X_encoded = tokenizer(self.X,max_length=40, truncation=True, padding="max_length", return_tensors="pt")
        self.input_ids = self.X_encoded['input_ids']
        self.attention_mask = self.X_encoded['attention_mask']

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return (self.input_ids[idx], self.attention_mask[idx])

### préparation

In [125]:

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({"pad_token": "<pad>", "bos_token": "<startofstring>", "eos_token": "<endofstring>"})
tokenizer.add_tokens(["<bot>:"]) 
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.resize_token_embeddings(len(tokenizer))

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
model = model.to(device)


In [126]:
# tokenizer.decode(model.generate(**tokenizer('hey i was good at basketball but ', return_tensors='pt'))[0])

In [127]:
chatData = ChatData("chat_data.json", tokenizer)
chatData =  DataLoader(chatData, batch_size=64)

<startofstring> Bonjour ! Je visite votre site portfolio et je suis intÃ©ressÃ© par vos projets. Pouvez-vous me montrer quelques-uns de vos travaux rÃ©cents? <bot>: Bonjour ! Bien sÃ»r, je serais ravi de vous montrer mes projets. Voici l'un de mes projets rÃ©cents : [Nom du Projet 1]. Il s'agit d'un site web interactif pour une entreprise locale. Que pensez-vous de ce projet? <endofstring>


In [128]:
def infer(inp):
    inp = "<startofstring> "+inp+" <bot>: "
    inp = tokenizer(inp, return_tensors="pt")
    X = inp["input_ids"].to(device)
    a = inp["attention_mask"].to(device)
    output = model.generate(X, attention_mask=a)
    output = tokenizer.decode(output[0])
    return output


In [129]:
def train(chatData, model, optim):

    epochs = 100

    for i in tqdm.tqdm(range(epochs)):
        for X, a in chatData:
            X = X.to(device)
            a = a.to(device)
            optim.zero_grad()
            loss = model(X, attention_mask=a, labels=X).loss
            loss.backward()
            optim.step()
        torch.save(model.state_dict(), "model_state.pt")
        print(infer("Bonjour !"))

### entraînement

In [130]:
model.train()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50261, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50261, bias=False)
)

In [131]:
optim = Adam(model.parameters(), lr=1e-3)

print("training .... ")
train(chatData, model, optim)

training .... 


  0%|          | 0/12 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  8%|▊         | 1/12 [00:27<04:59, 27.23s/it]

<startofstring>Bonjour! <bot>: <pad><pad><pad>ClockWallClock<pad><pad>ClockWall<pad><pad>Clock<pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 17%|█▋        | 2/12 [00:37<02:53, 17.31s/it]

<startofstring>Bonjour! <bot>: -B


B-S



B



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 25%|██▌       | 3/12 [00:46<02:01, 13.54s/it]

<startofstring>Bonjour! <bot>: ------<startofstring>----<startofstring>-<startofstring>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 33%|███▎      | 4/12 [00:55<01:34, 11.80s/it]

<startofstring>Bonjour! <bot>: <startofstring><startofstring><startofstring><startofstring><startofstring>-<startofstring><startofstring><startofstring><startofstring><startofstring><startofstring><startofstring><startofstring>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 42%|████▏     | 5/12 [01:04<01:15, 10.78s/it]

<startofstring>Bonjour! <bot>:..............


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 50%|█████     | 6/12 [01:13<01:00, 10.15s/it]

<startofstring>Bonjour! <bot>: is.............


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 58%|█████▊    | 7/12 [01:22<00:48,  9.77s/it]

<startofstring>Bonjour! <bot>: isisisisis.isisisis.isisis


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 67%|██████▋   | 8/12 [01:31<00:38,  9.56s/it]

<startofstring>Bonjour! <bot>:..isisisisisisisisisisis.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 75%|███████▌  | 9/12 [01:40<00:28,  9.36s/it]

<startofstring>Bonjour! <bot>:..............


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 83%|████████▎ | 10/12 [01:49<00:18,  9.21s/it]

<startofstring>Bonjour! <bot>: is....'j..'''''


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 92%|█████████▏| 11/12 [01:58<00:09,  9.18s/it]

<startofstring>Bonjour! <bot>: jjjj..........


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
100%|██████████| 12/12 [02:07<00:00, 10.65s/it]

<startofstring>Bonjour! <bot>:.'''''''.'.'''





In [134]:
print("infer from model : ")
while True:
  inp = input()
  print(infer(inp))

infer from model : 


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


<startofstring>bonjour! <bot>:. de de de de de de de de de de de de de


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


<startofstring>Bonjour! <bot>: -©'''.......'.


KeyboardInterrupt: Interrupted by user