### Instructions on Use:
1. To load **MA-BERT**, copy ```ma_bert``` folder into the ```transformers/models/bert``` folder.
2. To load **MA-DistilBERT**, copy ```ma_distilbert``` folder into the ```transformers/models/distilbert``` folder.

In [1]:
from transformers.models.bert.ma_bert.modeling_ma_bert import MA_BertForMaskedLM
from transformers.models.bert.ma_bert.configuration_ma_bert import MA_BertConfig
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  return torch._C._cuda_getDeviceCount() > 0


### Loading MA-BERT from Pretrained Checkpoint

In [2]:
ma_bert_config = MA_BertConfig(verbose = True)
ma_bert = MA_BertForMaskedLM(ma_bert_config)

### Modify the checkpoint path if needed
ma_bert_ckpt_path = "ma_bert_mlm_ckpt.pt" ### Assuming checkpoint is stored in the same directory
ma_bert.load_state_dict(torch.load(ma_bert_ckpt_path, map_location = device))

Modifications Applied:
Softmax Approximation: True, Share Softmax: False, Input Size: 128, Hidden Size: 128
Normalization: Power, Warm-up Iterations: 997120, Accumulation Steps: 8
Encoder Activation Function: relu


<All keys matched successfully>

### Loading MA-BERT (Shared Softmax) from Pretrained Checkpoint

In [3]:
### By default, a separate 2-layer neural network to approximate softmax is assigned to each encoder layer. 
### Set share_softmax_nn = True to allow the 2-layer neural network to be shared across all encoder layer
ma_bert_config_shared_softmax = MA_BertConfig(share_softmax_nn = True, verbose = True)
ma_bert_shared_softmax = MA_BertForMaskedLM(ma_bert_config_shared_softmax)

### Modify the checkpoint path if needed
ma_bert_shared_softmax_ckpt_path = "ma_bert_mlm_ckpt_shared_softmax.pt" ### Assuming checkpoint is stored in the same directory
ma_bert_shared_softmax.load_state_dict(torch.load(ma_bert_shared_softmax_ckpt_path, map_location = device))

Modifications Applied:
Softmax Approximation: True, Share Softmax: True, Input Size: 128, Hidden Size: 128
Normalization: Power, Warm-up Iterations: 997120, Accumulation Steps: 8
Encoder Activation Function: relu


<All keys matched successfully>

### Loading MA-DistilBERT from Pretrained Checkpoint

In [4]:
from transformers.models.distilbert.ma_distilbert.modeling_ma_distilbert import MA_DistilBertForMaskedLM
from transformers.models.distilbert.ma_distilbert.configuration_ma_distilbert import MA_DistilBertConfig

import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [5]:
ma_distilbert_config = MA_DistilBertConfig(verbose = True)
ma_distilbert = MA_DistilBertForMaskedLM(ma_distilbert_config)

### Modify the checkpoint path if needed
ma_distilbert_ckpt_path = "ma_distilbert_mlm_ckpt.pt" ### Assuming checkpoint is stored in the same directory
ma_distilbert.load_state_dict(torch.load(ma_distilbert_ckpt_path, map_location = device))

Modifications Applied:
Softmax Approximation: True, Share Softmax: False, Input Size: 128, Hidden Size: 128
Normalization: Power, Warm-up Iterations: 997120, Accumulation Steps: 8
Encoder Activation Function: relu


<All keys matched successfully>