# Data Preparation for Fine-Tuning Transformer Model on Azure Databricks

## Overview
This notebook prepares and saves the datasets that will be used for the fine-tuning of a transformer model. The data is loaded, combined, and stored as Delta tables on Azure Databricks for subsequent machine learning workflows.

## Datasets
- **train_data.jsonl**: Training set to be used for model fine-tuning.
- **val_data.jsonl**: Validation set for model evaluation.
- **test_data.jsonl**: Test set for final assessment.

## Author
- Name: Alessandro Armillotta
- Date: 09/10/2025

# Steps
1. Load JSONL datasets from Azure Databricks Volumes.
2. Combine training and test datasets and prepare the labels
3. Save processed DataFrames as Delta tables for downstream fine-tuning tasks.

In [0]:
from pyspark.sql import functions as F
import pandas as pd

### Step 1: Load JSONL datasets from Azure Databricks Volumes.

In [0]:
test_data = spark.read.parquet("/Volumes/main/fine_tuning_transformer_model/files/test.parquet")
train_data = spark.read.parquet("/Volumes/main/fine_tuning_transformer_model/files/train.parquet")

In [0]:
train_data.display()

### 2. Combine training and test datasets and prepare the labels

In [0]:
# create label as integer
# some transformer models require to have labels as integer

union_df = train_data.unionAll(test_data)

The transform works on the id column, which can be either a string or an integer. You can use the cell below if you need to convert label strings to ids.

In [0]:
# get labels and create label id
labels_df = union_df.select(union_df.label).groupBy(union_df.label).count()
labels = labels_df.collect()

# create label with id
id2label = {index: row.label for (index, row) in enumerate(labels)}
label2id = {row.label: index for (index, row) in enumerate(labels)}
print(f"Number of labels: {len(labels)}")

In [0]:


# replace labels with ids
#@F.pandas_udf('integer')
#def replace_labels_with_ids(labels: pd.Series) -> pd.Series:
#  return labels.apply(lambda x: label2id[x])
#
#train_data = train_data.select(replace_labels_with_ids(train_data.label).alias('label_id')
#                      ,train_data.text
#                      ,train_data.label
#                      )
#
#test_data = test_data.select(replace_labels_with_ids(test_data.label).alias('label_id')
#                      ,test_data.text
#                      ,test_data.label
#                      )
#

### Step 3: Save processed DataFrames as Delta tables for downstream fine-tuning tasks.

In [0]:
# train dataset length
train_data.count()

In [0]:
# labels distribution
train_data.groupBy("label").count().display()

In [0]:
train_data.write.mode("overwrite").option("mergeSchema", "true").saveAsTable("main.fine_tuning_transformer_model.train_data")

In [0]:
test_data.count()

In [0]:
# labels distribution
test_data.groupBy("label").count().display()

In [0]:
test_data.write.mode("overwrite").option("mergeSchema", "true").saveAsTable("main.fine_tuning_transformer_model.test_data")

In [0]:
labels_df.write.mode("overwrite").option("mergeSchema", "true").saveAsTable("main.fine_tuning_transformer_model.labels")