<a href="https://colab.research.google.com/github/cravolux/bitnami/blob/main/codedutravail.lu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1 Instalar Bibliotecas Essenciais

In [1]:
!pip install transformers datasets accelerate peft bitsandbytes
!pip install trl accelerate
!pip install -U bitsandbytes
!pip install -U trl wandb



2 carregar os ficheiros

In [2]:
from google.colab import files

uploaded = files.upload()

# Assumindo que o nome do seu arquivo seja 'meu_dataset.jsonl'
file_name = 'meu_dataset.jsonl'

Saving lux_travail_jurisprudence_batch1.jsonl to lux_travail_jurisprudence_batch1 (1).jsonl
Saving lux_travail_jurisprudence_seed.jsonl to lux_travail_jurisprudence_seed (1).jsonl
Saving code_travail_qa_dataset_enriched.jsonl to code_travail_qa_dataset_enriched (1).jsonl


3 Carregar o Arquivo JSONL

```
# This is formatted as code
```



In [None]:
from datasets import load_dataset, concatenate_datasets, Value
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTTrainer
import torch
import os

print("1. Re-chargement et unification des datasets...")

# --- ÉTAPE 1 & 2: CHARGEMENT ET PRÉPARATION DES DONNÉES ---
file_names = [
    'lux_travail_jurisprudence_batch1.jsonl',
    'lux_travail_jurisprudence_seed.jsonl',
    'code_travail_qa_dataset_enriched.jsonl'
]

all_datasets = []
INSTRUCTION_COLUMN = 'prompt'
RESPONSE_COLUMN = 'response'

for file_name in file_names:
    try:
        dataset = load_dataset('json', data_files=file_name, split='train')
        if INSTRUCTION_COLUMN in dataset.column_names and RESPONSE_COLUMN in dataset.column_names:
            dataset = dataset.select_columns([INSTRUCTION_COLUMN, RESPONSE_COLUMN])
            dataset = dataset.cast_column(INSTRUCTION_COLUMN, Value('string'))
            dataset = dataset.cast_column(RESPONSE_COLUMN, Value('string'))
            all_datasets.append(dataset)

    except Exception as e:
        print(f"ATTENTION : Erreur lors du chargement de {file_name}. Assurez-vous qu'il est accessible. Erreur: {e}")
        pass

if all_datasets:
    combined_dataset = concatenate_datasets(all_datasets)
else:
    print("   ❌ Aucun dataset n'a été chargé. Arrêt.")
    exit()

def format_llm_instruction(example):
    prompt_template = f"### Instruction:\n{example[INSTRUCTION_COLUMN]}\n\n### Réponse:\n{example[RESPONSE_COLUMN]}"
    return {"text": prompt_template}

dataset_formatted = combined_dataset.map(
    format_llm_instruction,
    remove_columns=combined_dataset.column_names
)
dataset_split = dataset_formatted.train_test_split(test_size=0.1, seed=42)
print("   ✅ Dataset formaté et divisé. Variable 'dataset_split' définie.")

print("\n" + "="*80 + "\n")

# --- ÉTAPE 4: CONFIGURATION ET DÉMARRAGE DE L'ENTRAÎNEMENT (FINE-TUNING) ---

OUTPUT_DIR = "./results_lux_law_llm_phi3"
model_id = "microsoft/Phi-3-mini-4k-instruct"

# 2. Configuration QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 3. Charger le Modèle et le Tokenizer
print(f"2. Chargement du Modèle {model_id} avec QLoRA...")

torch.cuda.empty_cache()

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

model.config.use_cache = False
model = prepare_model_for_kbit_training(model)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# 4. Configurer le LoRA - CORRECTION APPLIQUÉE
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    # NOUVEAU: Spécifie les couches d'attention pour le modèle Phi-3
    target_modules=["qkv_proj", "o_proj"],
)

# 5. Configurer les Arguments de l'Entraînement - Optimisation mémoire maximale
training_arguments = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim="paged_adamw_32bit",
    save_steps=500,
    logging_steps=100,
    learning_rate=2e-4,
    fp16=False,
    bf16=True,
    group_by_length=True,
    lr_scheduler_type="cosine",
)

# 6. Initialiser le Trainer (SFTTrainer)
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_split['train'],
    eval_dataset=dataset_split['test'],
    peft_config=peft_config,
    args=training_arguments,
)

# 7. DÉMARRAGE DE L'ENTRAÎNEMENT!
print("\n🔥 DÉMARRAGE DE L'ENTRAÎNEMENT...")
trainer.train()

# 8. Sauvegarder le modèle final
final_output_dir = "./final_lux_law_model_phi3"
trainer.model.save_pretrained(final_output_dir)
tokenizer.save_pretrained(final_output_dir)
print(f"\n✅ Entraînement terminé. Modèle sauvé dans: {final_output_dir}")

1. Re-chargement et unification des datasets...


Generating train split: 0 examples [00:00, ? examples/s]

Casting the dataset:   0%|          | 0/11 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Casting the dataset:   0%|          | 0/5 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/5 [00:00<?, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Casting the dataset:   0%|          | 0/6816 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/6816 [00:00<?, ? examples/s]

Map:   0%|          | 0/6832 [00:00<?, ? examples/s]

   ✅ Dataset formaté et divisé. Variable 'dataset_split' définie.


2. Chargement du Modèle microsoft/Phi-3-mini-4k-instruct avec QLoRA...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Adding EOS to train dataset:   0%|          | 0/6148 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/6148 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (4935 > 4096). Running this sequence through the model will result in indexing errors


Truncating train dataset:   0%|          | 0/6148 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/684 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/684 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/684 [00:00<?, ? examples/s]


🔥 DÉMARRAGE DE L'ENTRAÎNEMENT...


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjose-cravo[0m ([33mjose-cravo-notion[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  return fn(*args, **kwargs)


Step,Training Loss


testar

In [None]:
trainer.push_to_hub("lux-law-phi3-mini-qlora-v1")