# Fine-Tuning Indonesian NLI Models on Kaggle

This notebook fine-tunes two models (IndoRoBERTa and IndoBERT) on the Indonesian NLI (IndoNLI) dataset. It is adapted to run on Kaggle and integrates MLFlow, TensorBoard, and DVC for comprehensive experiment tracking.

## 1. Setup and Dependencies

In [None]:
!pip install --upgrade torch torchvision torchaudio mlflow transformers datasets scikit-learn accelerate evaluate tensorboard dvc[gdrive] Pillow

## 2. Imports and Configuration

In [None]:
import torch
import mlflow
import json
import os
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
import evaluate
from kaggle_secrets import UserSecretsClient

### Configuration Parameters
Set `USE_SMALL_SUBSET` to `True` for a quick test run. Set `NUM_TRAIN_EPOCHS` to control the training duration.

In [None]:
USE_SMALL_SUBSET = True
NUM_TRAIN_EPOCHS = 1

## 3. Setup Credentials with Kaggle Secrets
Before running the next cell, you need to add your secrets to Kaggle:
1. **`GIT_TOKEN`**: Your GitHub Personal Access Token with repo access.
2. **`GDRIVE_CREDENTIALS_DATA`**: The content of your `gdrive-credentials.json` file.

In [None]:
user_secrets = UserSecretsClient()
GIT_TOKEN = user_secrets.get_secret("GIT_TOKEN")
GDRIVE_CREDS = user_secrets.get_secret("GDRIVE_CREDENTIALS_DATA")

# Configure Git
os.system(f"git config --global user.email 'your_email@example.com'")
os.system(f"git config --global user.name 'fabhiansan'")

# Write DVC credentials to a file
with open("gdrive-credentials.json", "w") as f:
    f.write(GDRIVE_CREDS)

## 4. Clone Repository

In [None]:
!git clone https://fabhiansan:{GIT_TOKEN}@github.com/fabhiansan/logging_experiment.git
%cd logging_experiment

## 5. Data Loading and Preprocessing

In [None]:
print("Loading IndoNLI dataset...")
dataset = load_dataset("afaji/indonli", trust_remote_code=True)

if USE_SMALL_SUBSET:
    print("Using a small subset of the data for a quick run.")
    dataset["train"] = dataset["train"].select(range(100))
    dataset["validation"] = dataset["validation"].select(range(50))
    dataset["test"] = dataset["test"].select(range(50))

print("Dataset loaded.")
print(dataset)

def preprocess_function(examples, tokenizer):
    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True, padding="max_length")

## 6. Model Fine-Tuning

In [None]:
# This is the same training function from the previous notebook
run_training()

## 7. Experiment Tracking

### TensorBoard

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs

### MLFlow
After the experiment, the `mlruns` directory is created. We will zip it for easy download. You can then unzip it on your local machine and run `mlflow ui` to view the results.

In [None]:
!zip -r /kaggle/working/mlruns.zip mlruns

### DVC and Git
Track metrics with DVC and push all changes to GitHub.

In [None]:
!dvc init -f
# Replace <your_gdrive_folder_id> with your actual Google Drive folder ID
!dvc remote add -d gdrive gdrive://<your_gdrive_folder_id>
!dvc remote modify gdrive gdrive_creds_file /kaggle/working/logging_experiment/gdrive-credentials.json

!dvc add metrics/roberta_metrics.json metrics/bert_metrics.json
!dvc push

!git add .
!git commit -m "Kaggle experiment run with metrics"
!git push https://fabhiansan:{GIT_TOKEN}@github.com/fabhiansan/logging_experiment.git main