# Hugging Face + LibTorch: Fine-tuning BERT in Pure C++

This notebook demonstrates a complete **pure C++ (LibTorch)** workflow for fine-tuning a BERT model:

## Workflow Overview

1. **Setup**: Install tools, authenticate, configure LibTorch
2. **Load Model**: Download or load a pre-trained BERT/DistilBERT model from HuggingFace
3. **Inference (Before)**: Test the model's generation before fine-tuning
4. **Load Dataset**: Download and prepare a training dataset
5. **Fine-tune Model**: Train the model on your dataset
6. **Inference (After)**: Test the model's generation after fine-tuning to see improvements
7. **Save Model**: Save the fine-tuned model checkpoint
8. **Upload to HuggingFace**: Upload your fine-tuned model back to the Hub

## What This Notebook Shows

- **Pure C++ workflow**: No Python dependencies for model training or inference
- **HuggingFace integration**: Download models, upload checkpoints using `curl` and `git`
- **Fine-tuning demonstration**: Train a BERT model on a custom dataset
- **Before/After comparison**: See how fine-tuning affects model behavior

## Important Notes

- BERT/DistilBERT are **bidirectional encoders**, not designed for autoregressive text generation
- The generation examples may produce `[UNK]` tokens - this is expected behavior
- For better text generation, consider using a decoder model (GPT-style)
- This notebook uses a **plain-text dataset** and a **tokenizer** implemented in C++

## 1) Install Hugging Face tooling (pick one path)

### Path A (Python-based): `huggingface-cli`

> This path requires Python because `huggingface-cli` comes from the `huggingface_hub` Python package.

- `python -m pip install -U huggingface_hub`
- `conda install -n cpp-notebooks -c conda-forge -y huggingface_hub`
- `mamba install -n cpp-notebooks -c conda-forge -y huggingface_hub`

### Path B (no Python): `curl` + `git`

> This notebook supports a no-Python path using `curl` for downloads and `git` for uploads (both invoked from C++).

- Install on Ubuntu/WSL: `sudo apt-get update && sudo apt-get install -y curl git`

After installing, verify with the next cell.

In [None]:
#include <cstdlib>
#include <iostream>

// Prefer the no-Python path: curl is enough for downloads.
int rc_curl = std::system("curl --version");
std::cout << "curl rc=" << rc_curl << std::endl;

// Optional: git for uploads (no Python).
int rc_git = std::system("git --version");
std::cout << "git rc=" << rc_git << std::endl;

// Optional: git-lfs for large model files (recommended for weights).
int rc_lfs = std::system("git lfs version");
std::cout << "git-lfs rc=" << rc_lfs << std::endl;

// Optional: huggingface-cli (Python-based).
int rc_hf = std::system("huggingface-cli --help");
std::cout << "huggingface-cli rc=" << rc_hf << std::endl;

## 2) Authenticate

There are two practical auth paths:

### Path A (Python-based): `huggingface-cli login` (optional)
- Requires `huggingface-cli` (from `huggingface_hub`) on PATH
- Useful if you want the CLI-based `download/upload` helpers

### Path B (no Python): token passed to `curl`/`git` helpers (recommended here)
- Set an environment variable: `HF_TOKEN`
- Pass it to:
  - `download_file_http(..., token)` for private downloads
  - `upload_file_git(..., token)` for uploads
- You do **not** need to run `huggingface-cli login` for this path

Notes:
- Avoid pasting tokens into notebooks if you plan to commit them.
- If you exported `HF_TOKEN` in a terminal after the notebook kernel started, restart the kernel so it can see it.

In [None]:
#include <iostream>
#include <string>
#include "include/mcppfa/huggingface.hpp"

// Read token from secrets.txt (raw token on first line).
// Do NOT commit secrets.txt; add it to .gitignore.

std::string token;
try {
    token = mcppfa::hf::read_token_file("secrets.txt");
    std::cout << "Loaded token from secrets.txt (length=" << token.size() << ")\n";
} catch (const std::exception& e) {
    std::cerr << "Could not read secrets.txt: " << e.what() << "\n";
    std::cerr << "Falling back to env var HF_TOKEN (if set for this kernel).\n";
}

try {
    auto res = mcppfa::hf::login(token, "HF_TOKEN");
    std::cout << "login rc=" << res.exit_code << std::endl;
} catch (const std::exception& e) {
    std::cerr << "Login skipped/failed: " << e.what() << std::endl;
}

## 3) Load Model and Tokenizer

This step downloads a pre-trained BERT/DistilBERT model from HuggingFace (or uses local files if they already exist).

The model will be loaded into memory and ready for inference and fine-tuning.

**Note**: If you've already downloaded the model in a previous run, the code will detect local files and reuse them.

In [None]:
// LibTorch dynamic linking for xcpp17/cling (required for <torch/torch.h>)
#pragma cling add_include_path("/home/warawreh/MakeCPPFunAgain/libtorch/include")
#pragma cling add_include_path("/home/warawreh/MakeCPPFunAgain/libtorch/include/torch/csrc/api/include")
#pragma cling add_library_path("/home/warawreh/MakeCPPFunAgain/libtorch/lib")
#pragma cling load("/home/warawreh/MakeCPPFunAgain/libtorch/lib/libc10.so")
#pragma cling load("/home/warawreh/MakeCPPFunAgain/libtorch/lib/libtorch_cpu.so")
#pragma cling load("/home/warawreh/MakeCPPFunAgain/libtorch/lib/libtorch.so")

#include <iostream>
#include <filesystem>
#include "include/mcppfa/bert_huggingface.hpp"
#include "include/mcppfa/tokenizer_decoder.hpp"

// Load BERT model and tokenizer from HuggingFace (or use local files if they exist)
const std::string repo_id = "distilbert/distilbert-base-uncased";
const std::string model_name = "distilbert-base-uncased";
const std::string local_dir = ".hf/" + model_name;
const std::string tokenizer_path = local_dir + "/tokenizer.json";
const std::string config_path = local_dir + "/config.json";
const std::string weights_path_safetensors = local_dir + "/model.safetensors";
const std::string weights_path_bin = local_dir + "/pytorch_model.bin";

// Check if local files exist
bool tokenizer_exists = std::filesystem::exists(tokenizer_path);
bool config_exists = std::filesystem::exists(config_path);
bool weights_exist = std::filesystem::exists(weights_path_safetensors) || 
                     std::filesystem::exists(weights_path_bin);

// Load tokenizer
mcppfa::hf::BERTTokenizerWrapper tokenizer;
if (tokenizer_exists) {
    std::cout << "Tokenizer found locally at: " << tokenizer_path << std::endl;
    // Since BERTTokenizerWrapper doesn't have load_from_local, we need to download
    // but it will use the existing file. Actually, let's check the implementation...
    // The load_from_hf will try to download. Let's work around this by checking
    // if we can avoid the download call.
    std::cout << "Note: Tokenizer file exists, but wrapper will still attempt download." << std::endl;
    std::cout << "      The download will likely reuse the existing file." << std::endl;
    tokenizer.load_from_hf(repo_id, "", mcppfa::hf::RepoType::model, "main");
} else {
    std::cout << "Downloading tokenizer from HuggingFace..." << std::endl;
    tokenizer.load_from_hf(repo_id, "", mcppfa::hf::RepoType::model, "main");
}
std::cout << "Tokenizer loaded from: " << tokenizer.tokenizer_path() << std::endl;

// Load BERT model
mcppfa::hf::BERTModelWrapper bert_model;
if (config_exists && weights_exist) {
    std::cout << "Model files found locally:" << std::endl;
    if (config_exists) std::cout << "  Config: " << config_path << std::endl;
    if (std::filesystem::exists(weights_path_safetensors)) {
        std::cout << "  Weights: " << weights_path_safetensors << std::endl;
    } else if (std::filesystem::exists(weights_path_bin)) {
        std::cout << "  Weights: " << weights_path_bin << std::endl;
    }
    std::cout << "Note: Model files exist, but wrapper will still attempt download." << std::endl;
    std::cout << "      The download will likely reuse existing files." << std::endl;
    bert_model.load_from_hf(repo_id, "", mcppfa::hf::RepoType::model, "main");
} else {
    std::cout << "Downloading model from HuggingFace..." << std::endl;
    bert_model.load_from_hf(repo_id, "", mcppfa::hf::RepoType::model, "main");
}
std::cout << "BERT model loaded from: " << bert_model.weights_path() << std::endl;
std::cout << "Config loaded from: " << bert_model.config_path() << std::endl;


## 4) Inference Before Fine-tuning

Let's test the model's text generation capabilities **before** fine-tuning. This gives us a baseline to compare against after training.

**Note**: BERT/DistilBERT are bidirectional encoders, not designed for autoregressive generation. You may see `[UNK]` tokens in the output - this is expected behavior.


In [None]:

// ===== INFERENCE: Generate 100 tokens =====
std::cout << "\n=== Starting Inference ===" << std::endl;

// Set model to evaluation mode
if (bert_model.is_distilbert()) {
    bert_model.distilbert_model()->eval();
} else {
    bert_model.bert_model()->eval();
}
// Load tokenizer decoder for encoding/decoding (matches Python's tokenizer.encode/decode)
mcppfa::tokenizer::TokenizerDecoder tokenizer_decoder;
tokenizer_decoder.load_from_file(tokenizer.tokenizer_path());
std::cout << "Tokenizer loaded: " << tokenizer_decoder.vocab_size() << " tokens" << std::endl;


In [None]:
// ===== INFERENCE: Using the new predict() method (like Python transformers) =====
// The predict() method maintains internal state and generates tokens one at a time

std::string prompt = "The red fox";
std::cout << "Prompt: " << prompt << std::endl;

// Initialize the model's generation state with the prompt
// This is similar to Python: model.generate(tokenizer.encode(prompt), ...)
bert_model.reset(tokenizer_decoder, prompt);

// Display initial encoded tokens
std::cout << "Encoded input_ids: ";
for (auto id : bert_model.get_input_ids()) std::cout << id << ' ';
std::cout << std::endl;

// Generation parameters (like Python transformers)
const double temperature = 0.8;  // Lower temperature = more conservative
const int64_t top_k = 50;         // Only sample from top 50 tokens
const size_t max_tokens = 50;     // Maximum tokens to generate

std::cout << "\n=== Generating with predict() method ===" << std::endl;
std::cout << "Note: BERT is not designed for autoregressive generation." << std::endl;
std::cout << "Special tokens are automatically filtered (like transformers library).\n" << std::endl;

// Generate tokens using the predict() method
// Each call generates the next token and automatically updates internal state
for (size_t t = 0; t < max_tokens; ++t) {
    int64_t next_token = bert_model.predict(
        tokenizer_decoder,
        temperature,
        top_k,
        false  // not greedy, use sampling
    );
    
    // predict() returns -1 if generation should stop (e.g., [SEP] token or max length)
    if (next_token == -1) {
        std::cout << "Generation stopped (reached stopping condition)." << std::endl;
        break;
    }
    
    // Progress indicator
    if ((t + 1) % 10 == 0) {
        std::cout << "Generated " << (t + 1) << " tokens..." << std::endl;
    }
}

// Decode output (automatically skips special tokens)
std::string output_text = tokenizer_decoder.decode(bert_model.get_input_ids());

// Show the generated completion
std::cout << "\n----\nGenerated text:\n" << output_text << std::endl;

std::cout << "\nNote: Special tokens are automatically filtered during generation and decoding." << std::endl;
std::cout << "This matches how Python's transformers library handles special tokens." << std::endl;

## 5) Load Dataset

Download and prepare the training dataset. We'll use the Tiny Shakespeare dataset as an example.

The dataset will be tokenized and ready for fine-tuning.


## 6) Fine-tune Model

Now we'll fine-tune the pre-trained model on our dataset. This adapts the model to the specific domain or task.

**Training parameters:**
- Learning rate: 5e-5 (standard for BERT fine-tuning)
- Optimizer: AdamW
- Batch size: 1 (sequence length: 64)
- Training steps: Configurable (default: 100 chunks)

Watch the loss decrease as the model learns!


In [None]:
// Fine-tune the model on the Shakespeare dataset (tiny demo, 3-4 steps)

#include <torch/torch.h>
#include <iostream>

const std::string data_path = ".hf/tiny_shakespeare/input.txt";

// Open the dataset (already downloaded as data_path)
std::ifstream infile(data_path);
if (!infile.is_open()) {
    throw std::runtime_error("Failed to open input.txt for fine-tuning.");
}
std::string data((std::istreambuf_iterator<char>(infile)), std::istreambuf_iterator<char>());

// Tokenize the entire text
std::vector<int64_t> train_tokens = tokenizer_decoder.encode(data);

// We'll train on short unrolled chunks
const int64_t chunk_length = 64;
const int64_t num_chunks = 10; // only do 3 steps as a demonstration

// Optimizer (very simple, AdamW + a small learning rate)
// Get parameters from the underlying model
torch::optim::AdamW optimizer(
    bert_model.is_distilbert() 
        ? bert_model.distilbert_model()->parameters()
        : bert_model.bert_model()->parameters(),
    torch::optim::AdamWOptions(5e-5)
);

// Set model to training mode
if (bert_model.is_distilbert()) {
    bert_model.distilbert_model()->train();
} else {
    bert_model.bert_model()->train();
}
for (int64_t c = 0; c < num_chunks; ++c) {
    // Get chunk, unroll by chunk_length tokens
    int64_t start = c * chunk_length;
    if (start + chunk_length + 1 >= (int64_t)train_tokens.size()) break;
    std::vector<int64_t> input_chunk(
        train_tokens.begin() + start,
        train_tokens.begin() + start + chunk_length
    );
    std::vector<int64_t> target_chunk(
        train_tokens.begin() + start + 1,
        train_tokens.begin() + start + chunk_length + 1
    );

    // Prepare tensors
    torch::Tensor x = torch::from_blob(input_chunk.data(), {1, chunk_length}, torch::kInt64).clone();
    torch::Tensor y = torch::from_blob(target_chunk.data(), {1, chunk_length}, torch::kInt64).clone();
    torch::Tensor attention_mask = torch::ones({1, chunk_length}, torch::kInt64);

    // Forward
    torch::Tensor logits;
    if (bert_model.is_distilbert()) {
        logits = bert_model.distilbert_model()->forward(x, attention_mask); // [1, chunk_length, vocab_size]
    } else {
        logits = bert_model.bert_model()->forward(x, attention_mask);
    }

    // Compute loss (Cross Entropy over every position in the sequence)
    logits = logits.view({-1, logits.size(-1)});
    y = y.view({-1});
    torch::Tensor loss = torch::nn::functional::cross_entropy(logits, y);

    // Backward
    optimizer.zero_grad();
    loss.backward();
    optimizer.step();

    std::cout << "Fine-tuning step " << (c+1) << " / " << num_chunks << ", loss: " << loss.item<float>() << std::endl;
}

// Set model back to evaluation mode
if (bert_model.is_distilbert()) {
    bert_model.distilbert_model()->eval();
} else {
    bert_model.bert_model()->eval();
}
std::cout << "Done fine-tuning (demo: 3 steps on Shakespeare)!" << std::endl;

// Save the model after training
try {
    // You may choose your output filename. Here: "fine_tuned_model.pt"
    torch::serialize::OutputArchive archive;
    if (bert_model.is_distilbert()) {
        bert_model.distilbert_model()->save(archive);
    } else {
        bert_model.bert_model()->save(archive);
    }
    archive.save_to("fine_tuned_model.pt");
    std::cout << "Model saved to fine_tuned_model.pt" << std::endl;
} catch (const c10::Error& e) {
    std::cerr << "Error saving the model: " << e.msg() << std::endl;
}


## 7) Inference After Fine-tuning

Now let's test the model again with the same prompt. Compare the output to see how fine-tuning has changed the model's behavior!

**Compare with Section 4** to see the difference made by fine-tuning.


## 8) Save Fine-tuned Model

Save the fine-tuned model checkpoint to disk so you can load it later or upload it to HuggingFace.

The model is saved in PyTorch format (`.pt` file).


In [None]:
// Encode the prompt "The red fox"
std::string prompt = "The red fox";
std::vector<int64_t> input_ids = tokenizer_decoder.encode(prompt);
// Display input tokens
std::cout << "Prompt: " << prompt << std::endl;
std::cout << "Encoded input_ids: ";
for (auto id : input_ids) std::cout << id << ' ';
std::cout << std::endl;

// Generate 100 new tokens (rudimentary greedy, no sampling)
size_t max_tokens = 100;
const int64_t SEP_TOKEN = 102;  // [SEP] token ID

for (size_t t = 0; t < max_tokens; ++t) {
    // Prepare input tensor from current input_ids [B=1, T=seq_len]
    int64_t seq_len = static_cast<int64_t>(input_ids.size());
    torch::Tensor input_tensor = torch::from_blob(
        input_ids.data(), 
        {1, seq_len}, 
        torch::TensorOptions().dtype(torch::kInt64)
    ).clone();
    
    // Create attention mask (1 for all tokens)
    torch::Tensor attention_mask = torch::ones(
        {1, seq_len}, 
        torch::TensorOptions().dtype(torch::kInt64)
    );

    // Forward pass to get logits
    torch::NoGradGuard no_grad;
    torch::Tensor logits;
    if (bert_model.is_distilbert()) {
        logits = bert_model.distilbert_model()->forward(input_tensor, attention_mask);
    } else {
        logits = bert_model.bert_model()->forward(input_tensor, attention_mask);
    }
    // logits shape: [1, seq_len, vocab_size]

    // Get predicted token: argmax on the last position
    auto last_logits = logits[0][seq_len - 1];  // [vocab_size]
    int64_t next_token = torch::argmax(last_logits, -1).item<int64_t>();

    // Append the predicted token
    input_ids.push_back(next_token);

    // Stop if SEP token (end of sequence)
    if (next_token == SEP_TOKEN) break;
}

// Decode output
std::string output_text = tokenizer_decoder.decode(input_ids);

// Show the generated completion
std::cout << "----\nGenerated text:\n" << output_text << std::endl;


## 9) Upload Fine-tuned Model to HuggingFace

If you don’t want `huggingface-cli`, the simplest upload path is **git over HTTPS** (optionally with LFS for big files).

**Upload process:**
- Uses `git` over HTTPS (no Python required)
- Automatically handles large files with `git-lfs`
- Clones/pulls the repo, adds files, commits, and pushes

**Auth:** Uses token from `secrets.txt` or `HF_TOKEN` environment variable.

**Note:** Make sure you've completed Section 8 (Save Fine-tuned Model) first, so `fine_tuned_model.pt` exists.

In [None]:
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <filesystem>
#include "include/mcppfa/huggingface.hpp"

// Upload the fine-tuned model to HuggingFace Hub
// Make sure you've run Section 8 (Save Fine-tuned Model) first!

const std::string fine_tuned_model_path = "fine_tuned_model.pt";

// Check if the fine-tuned model exists
if (!std::filesystem::exists(fine_tuned_model_path)) {
    throw std::runtime_error("Fine-tuned model not found: " + fine_tuned_model_path + 
                              "\nPlease run Section 8 (Save Fine-tuned Model) first.");
}

std::cout << "Found fine-tuned model: " << fine_tuned_model_path << std::endl;
std::cout << "File size: " << std::filesystem::file_size(fine_tuned_model_path) << " bytes" << std::endl;

// Upload to a model repo you own (change this to your repo)
const std::string repo_id = "Warawreh/MCPPFA-demo-model";  // Update this to your HuggingFace username/repo

// Prefer secrets.txt; fall back to env var HF_TOKEN
std::string token;
try {
    token = mcppfa::hf::read_token_file("secrets.txt");
    std::cout << "Loaded token from secrets.txt (length=" << token.size() << ")\n";
} catch (const std::exception& e) {
    std::cerr << "Could not read secrets.txt: " << e.what() << "\n";
    if (const char* tok = std::getenv("HF_TOKEN")) {
        token = tok;
        std::cout << "Using token from HF_TOKEN environment variable\n";
    }
}

try {
    if (token.empty()) {
        std::cerr << "ERROR: Provide a token via secrets.txt or HF_TOKEN to upload." << std::endl;
        std::cerr << "Create secrets.txt with your HuggingFace token on the first line." << std::endl;
    } else {
        mcppfa::hf::GitUploadOptions opt;
        opt.use_lfs = true;  // Use git-lfs for large model files
        opt.stream_progress = true;
        opt.max_stream_bytes = 300000; // keep notebook output reasonable

        std::cout << "\n=== Uploading Fine-tuned Model ===" << std::endl;
        std::cout << "Repository: " << repo_id << std::endl;
        std::cout << "File: " << fine_tuned_model_path << std::endl;
        std::cout << "\nStarting upload (this may take a while for large models)...\n" << std::endl;

        // Upload the fine-tuned model checkpoint
        mcppfa::hf::HubUploader hub(repo_id, mcppfa::hf::RepoType::model, token, opt);
        hub.upload(fine_tuned_model_path, "fine_tuned_model.pt");
        
        auto log = hub.push("upload fine-tuned BERT model from C++");
        
        std::cout << "\n=== Upload Complete ===" << std::endl;
        std::cout << "Exit status: " << log.exit_status << std::endl;
        std::cout << "System return code: " << log.system_rc << std::endl;
        std::cout << "Total time: " << log.seconds_total << " seconds" << std::endl;
        std::cout << "\nUpload report:\n" << log.report << std::endl;
        
        if (log.exit_status == 0) {
            std::cout << "\n✓ Successfully uploaded fine-tuned model to HuggingFace Hub!" << std::endl;
            std::cout << "View it at: https://huggingface.co/" << repo_id << std::endl;
        } else {
            std::cerr << "\n✗ Upload may have failed. Check the report above for details." << std::endl;
        }
    }
} catch (const std::exception& e) {
    std::cerr << "Upload failed: " << e.what() << std::endl;
    std::cerr << "\nTroubleshooting:" << std::endl;
    std::cerr << "1. Make sure git-lfs is installed: sudo apt-get install -y git-lfs && git lfs install" << std::endl;
    std::cerr << "2. Verify your HuggingFace token is valid" << std::endl;
    std::cerr << "3. Check that the repository " << repo_id << " exists and you have write access" << std::endl;
}