<a href="https://colab.research.google.com/github/atharvajoshi10/FederatedML/blob/main/Federated_Roberta_Averaging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Base Model

Attempting to recreate the validation F1 of the following model using federated learning.
https://huggingface.co/JeremiahZ/roberta-base-mrpc

# Dependencies

GPU - T4

Note : Force Update accelerate since it defaults to the old version.
Restart the notebook once the below cell has executed.

In [1]:
%%capture
!pip install transformers
!pip install evaluate
!pip install datasets
!pip install accelerate -U

In [2]:
from datasets import load_dataset, load_from_disk
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, AutoConfig, TrainingArguments, DataCollatorWithPadding
import numpy as np
import evaluate

# Defining Training Functions
- Import dataset
- Import Base Model and tokenizer
- Define Training Hyperparameters
- Train Model

Passing all the paths and calling on the sharded dataset.

In [3]:
train_path = "/content/drive/MyDrive/FML Data/Federated Split/train/"
validation_path = "/content/drive/MyDrive/FML Data/Federated Split/validation/"
model_save_path = "/content/drive/MyDrive/FML Data/Federated Models/estimate_trainer/"

In [4]:
from transformers import RobertaForSequenceClassification, Trainer
import torch

In [5]:
num_models = 8

models = []
for model_number in range(num_models):
  models.append(RobertaForSequenceClassification.from_pretrained(model_save_path+ str(model_number)))

In [6]:
#Init
federated_model = RobertaForSequenceClassification.from_pretrained("roberta-base")

federated_model.classifier.dense.bias.data = torch.zeros(federated_model.classifier.dense.bias.data.size())
federated_model.classifier.dense.weight.data = torch.zeros(federated_model.classifier.dense.weight.data.size())

federated_model.classifier.out_proj.bias.data = torch.zeros(federated_model.classifier.out_proj.bias.data.size())
federated_model.classifier.out_proj.weight.data = torch.zeros(federated_model.classifier.out_proj.weight.data.size())

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
federated_model.classifier.out_proj.weight.data

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

In [8]:
#adding every model weight to the fed model
for model in models:
  federated_model.classifier.dense.bias.data = torch.add(federated_model.classifier.dense.bias.data, model.classifier.dense.bias.data)
  federated_model.classifier.dense.weight.data = torch.add(federated_model.classifier.dense.weight.data, model.classifier.dense.weight.data)

  federated_model.classifier.out_proj.bias.data = torch.add(federated_model.classifier.out_proj.bias.data, model.classifier.out_proj.bias.data)
  federated_model.classifier.out_proj.weight.data = torch.add(federated_model.classifier.out_proj.weight.data, model.classifier.out_proj.weight.data)

In [9]:
federated_model.classifier.out_proj.weight.data

tensor([[-0.0649, -0.2258, -0.0561,  ..., -0.2208, -0.1192, -0.0907],
        [-0.0979,  0.0809,  0.1546,  ...,  0.0016,  0.1148,  0.0969]])

In [10]:
federated_model.classifier.dense.bias.data = torch.div(federated_model.classifier.dense.bias.data, num_models)
federated_model.classifier.dense.weight.data = torch.div(federated_model.classifier.dense.weight.data, num_models)
federated_model.classifier.out_proj.bias.data = torch.div(federated_model.classifier.out_proj.bias.data, num_models)
federated_model.classifier.out_proj.weight.data = torch.div(federated_model.classifier.out_proj.weight.data, num_models)

In [11]:
federated_model.classifier.out_proj.weight.data

tensor([[-0.0081, -0.0282, -0.0070,  ..., -0.0276, -0.0149, -0.0113],
        [-0.0122,  0.0101,  0.0193,  ...,  0.0002,  0.0143,  0.0121]])

In [12]:
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

raw_datasets = load_dataset("glue", "mrpc")


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"])


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



training_args = TrainingArguments(model_save_path+"federated_trainer/", num_train_epochs=5, learning_rate=2e-05,
                                lr_scheduler_type="linear", per_device_train_batch_size=16, seed=42,
                                per_device_eval_batch_size=8, warmup_ratio=0.06)

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


trainer = Trainer(
    federated_model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator
)


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [13]:
predictions = trainer.predict(tokenized_datasets["validation"])
preds = np.argmax(predictions.predictions, axis=-1)
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.32598039215686275, 'f1': 0.03508771929824561}

In [14]:
# {'accuracy': 0.6838235294117647, 'f1': 0.8122270742358079}