<a href="https://colab.research.google.com/github/VellummyilumVinoth/Toxic_Comment_Classification/blob/main/finetuned_distilbert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning DistilBERT for Toxic Comment Classification

In [None]:
import torch

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla T4


In [None]:
import locale
print(locale.getpreferredencoding())

UTF-8


In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import string

train_data = pd.read_csv("/content/drive/MyDrive/Dats/Kaggle/train.csv",encoding = 'latin1')
test_data = pd.read_csv("/content/drive/MyDrive/Dats/Kaggle/test.csv",encoding = 'latin1')

In [None]:
!pip install --upgrade transformers 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m99.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m89.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


In [None]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn import metrics
import torch
from torch.utils.data import Dataset, DataLoader 
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import logging

logging.set_verbosity_warning()


## Setting up the device for GPU usage

Followed by that we will preapre the device for CUDA execeution. This configuration is needed if you want to leverage on onboard GPU.

In [None]:
from torch import cuda
device = torch.device('cuda' if cuda.is_available() else 'cpu')

print(f"Current device: {device}")

Current device: cuda


In [None]:
print(f"Total Training Records : {len(train_data)}")
train_data.head()

Total Training Records : 159571


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


## Removing id column and preparing labels into the single list column

In [None]:
train_data.drop(['id'], inplace=True, axis=1)
train_data['labels'] = train_data.iloc[:, 1:].values.tolist()
train_data.drop(train_data.columns.values[1:-1].tolist(), inplace=True, axis=1)
train_data.head()

Unnamed: 0,comment_text,labels
0,Explanation\nWhy the edits made under my usern...,"[0, 0, 0, 0, 0, 0]"
1,D'aww! He matches this background colour I'm s...,"[0, 0, 0, 0, 0, 0]"
2,"Hey man, I'm really not trying to edit war. It...","[0, 0, 0, 0, 0, 0]"
3,"""\nMore\nI can't make any real suggestions on ...","[0, 0, 0, 0, 0, 0]"
4,"You, sir, are my hero. Any chance you remember...","[0, 0, 0, 0, 0, 0]"


## Data Cleaning

- Lower case
- Remove extra space

In [None]:
train_data["comment_text"] = train_data["comment_text"].str.lower()
train_data["comment_text"] = train_data["comment_text"].str.replace("\xa0", " ", regex=False).str.split().str.join(" ")

In [None]:
train_data.head()


Unnamed: 0,comment_text,labels
0,explanation why the edits made under my userna...,"[0, 0, 0, 0, 0, 0]"
1,d'aww! he matches this background colour i'm s...,"[0, 0, 0, 0, 0, 0]"
2,"hey man, i'm really not trying to edit war. it...","[0, 0, 0, 0, 0, 0]"
3,""" more i can't make any real suggestions on im...","[0, 0, 0, 0, 0, 0]"
4,"you, sir, are my hero. any chance you remember...","[0, 0, 0, 0, 0, 0]"


# Training Parameters <a id='section03'></a>

Defining some key variables that will be used later on in the training


In [None]:
MAX_LEN = 128
TRAIN_BATCH_SIZE = 32
EPOCHS = 1
LEARNING_RATE = 2e-05
NUM_WORKERS = 2

# Preparing the Dataset and Dataloader <a id='section04'></a>
We will start with defining few key variables that will be used later during the training/fine tuning stage.
Followed by creation of MultiLabelDataset class - This defines how the text is pre-processed before sending it to the neural network. We will also define the Dataloader that will feed  the data in batches to the neural network for suitable training and processing. 
Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader read the [docs at PyTorch](https://pytorch.org/docs/stable/data.html)

## *MultiLabelDataset* Dataset Class
- This class is defined to accept the `tokenizer`, `dataframe`, `max_length` and `eval_mode` as input and generate tokenized output and tags that is used by the BERT model for training. 
- We are using the DistilBERT tokenizer to tokenize the data in the `text` column of the dataframe.
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`, `token_type_ids`

- To read further into the tokenizer, [refer to this document](https://huggingface.co/transformers/model_doc/distilbert.html#distilberttokenizer)
- `targets` is the list of categories labled as `0` or `1` in the dataframe. 
- The *MultiLabelDataset* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **80% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 

## Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [None]:
class MultiLabelDataset(Dataset):

    def __init__(self, dataframe, tokenizer, max_len: int, eval_mode: bool = False):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.text = dataframe.comment_text
        self.eval_mode = eval_mode 
        if self.eval_mode is False:
            self.targets = self.data.labels
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text.iloc[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]

        output = {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
        }
                
        if self.eval_mode is False:
            output['targets'] = torch.tensor(self.targets.iloc[index], dtype=torch.float)
                
        return output

## Loading tokenizer and generating training set

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', truncation=True, do_lower_case=True)
training_set = MultiLabelDataset(train_data, tokenizer, MAX_LEN)


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

## Verify the data at index 0

In [None]:
training_set[0]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


{'ids': tensor([  101,  7526,  2339,  1996, 10086,  2015,  2081,  2104,  2026,  5310,
         18442, 13076, 12392,  2050,  5470,  2020, 16407,  1029,  2027,  4694,
          1005,  1056,  3158,  9305, 22556,  1010,  2074,  8503,  2006,  2070,
          3806,  2044,  1045,  5444,  2012,  2047,  2259, 14421,  6904,  2278,
          1012,  1998,  3531,  2123,  1005,  1056,  6366,  1996, 23561,  2013,
          1996,  2831,  3931,  2144,  1045,  1005,  1049,  3394,  2085,  1012,
          6486,  1012, 16327,  1012,  4229,  1012,  2676,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,  

## Creating Dataloader

In [None]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': NUM_WORKERS
                }
training_loader = DataLoader(training_set, **train_params)

<a id='section05'></a>
# Neural Network for Fine Tuning


In [None]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model. 

class DistilBERTClass(torch.nn.Module):

    def __init__(self):
        super(DistilBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.1)
        self.classifier = torch.nn.Linear(768, 6)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.Tanh()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output


## Loading Neural Network model

In [None]:
model = DistilBERTClass()
model.to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DistilBERTClass(
  (l1): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in

## Loss Function and Optimizer

In [None]:
def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)

In [None]:
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

<a id='section06'></a>
# Fine Tuning the Model

In [None]:
def train(epoch):
    
    model.train()
    for _,data in tqdm(enumerate(training_loader, 0)):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)

        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)
        if _%100==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        loss.backward()
        optimizer.step()

In [None]:
for epoch in range(EPOCHS):
    train(epoch)

0it [00:00, ?it/s]

Epoch: 0, Loss:  0.6862832307815552


100it [00:35,  2.90it/s]

Epoch: 0, Loss:  0.09722387790679932


200it [01:09,  3.00it/s]

Epoch: 0, Loss:  0.16483023762702942


300it [01:43,  2.89it/s]

Epoch: 0, Loss:  0.053339049220085144


401it [02:17,  2.96it/s]

Epoch: 0, Loss:  0.06387589871883392


501it [02:51,  2.96it/s]

Epoch: 0, Loss:  0.03283071145415306


600it [03:25,  2.95it/s]

Epoch: 0, Loss:  0.09394600242376328


700it [03:59,  2.93it/s]

Epoch: 0, Loss:  0.04696191847324371


800it [04:33,  2.92it/s]

Epoch: 0, Loss:  0.012540765106678009


900it [05:07,  2.92it/s]

Epoch: 0, Loss:  0.02697247825562954


1001it [05:42,  2.94it/s]

Epoch: 0, Loss:  0.05560244247317314


1100it [06:15,  2.93it/s]

Epoch: 0, Loss:  0.0746593028306961


1200it [06:50,  2.93it/s]

Epoch: 0, Loss:  0.018051888793706894


1300it [07:24,  2.92it/s]

Epoch: 0, Loss:  0.02085130475461483


1400it [07:58,  2.93it/s]

Epoch: 0, Loss:  0.007233957760035992


1500it [08:32,  2.96it/s]

Epoch: 0, Loss:  0.025788171216845512


1601it [09:06,  2.95it/s]

Epoch: 0, Loss:  0.10573657602071762


1700it [09:40,  2.94it/s]

Epoch: 0, Loss:  0.0336749404668808


1800it [10:14,  2.94it/s]

Epoch: 0, Loss:  0.05498596653342247


1900it [10:48,  2.93it/s]

Epoch: 0, Loss:  0.018519870936870575


2000it [11:22,  2.94it/s]

Epoch: 0, Loss:  0.055615074932575226


2100it [11:56,  2.93it/s]

Epoch: 0, Loss:  0.04409198462963104


2200it [12:30,  2.93it/s]

Epoch: 0, Loss:  0.026907093822956085


2300it [13:04,  2.94it/s]

Epoch: 0, Loss:  0.006467844825237989


2400it [13:38,  2.96it/s]

Epoch: 0, Loss:  0.024246608838438988


2500it [14:12,  2.95it/s]

Epoch: 0, Loss:  0.07031972706317902


2600it [14:46,  2.96it/s]

Epoch: 0, Loss:  0.018620654940605164


2700it [15:20,  2.95it/s]

Epoch: 0, Loss:  0.03471093624830246


2800it [15:54,  2.92it/s]

Epoch: 0, Loss:  0.03168874979019165


2901it [16:28,  2.96it/s]

Epoch: 0, Loss:  0.05376681685447693


3000it [17:02,  2.94it/s]

Epoch: 0, Loss:  0.03281959891319275


3100it [17:36,  2.94it/s]

Epoch: 0, Loss:  0.03481123223900795


3200it [18:10,  2.91it/s]

Epoch: 0, Loss:  0.04180414602160454


3300it [18:44,  2.91it/s]

Epoch: 0, Loss:  0.056864798069000244


3400it [19:18,  2.94it/s]

Epoch: 0, Loss:  0.011437738314270973


3500it [19:52,  2.94it/s]

Epoch: 0, Loss:  0.020729854702949524


3600it [20:27,  2.94it/s]

Epoch: 0, Loss:  0.011952164582908154


3700it [21:01,  2.95it/s]

Epoch: 0, Loss:  0.061880551278591156


3800it [21:35,  2.93it/s]

Epoch: 0, Loss:  0.032920800149440765


3900it [22:09,  2.94it/s]

Epoch: 0, Loss:  0.02960791066288948


4000it [22:43,  2.93it/s]

Epoch: 0, Loss:  0.026404019445180893


4100it [23:17,  2.95it/s]

Epoch: 0, Loss:  0.03366443142294884


4200it [23:51,  2.92it/s]

Epoch: 0, Loss:  0.039142679423093796


4301it [24:26,  2.94it/s]

Epoch: 0, Loss:  0.027132278308272362


4400it [24:59,  2.95it/s]

Epoch: 0, Loss:  0.04447703808546066


4500it [25:33,  2.93it/s]

Epoch: 0, Loss:  0.04322236776351929


4600it [26:08,  2.92it/s]

Epoch: 0, Loss:  0.07052722573280334


4700it [26:42,  2.94it/s]

Epoch: 0, Loss:  0.015128618106245995


4800it [27:16,  2.95it/s]

Epoch: 0, Loss:  0.06700494885444641


4900it [27:50,  2.93it/s]

Epoch: 0, Loss:  0.011773976497352123


4987it [28:19,  2.93it/s]


# Generate Submissions.csv <a id='section07'></a>

In [None]:
test_data.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [None]:
test_data.rename(columns={"Title": "comment_text"}, inplace=True)
columns_to_keep = ['id', 'comment_text']
test_data = test_data[columns_to_keep]

In [None]:
test_data.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [None]:
test_set = MultiLabelDataset(test_data, tokenizer, MAX_LEN, eval_mode = True)
testing_params = {'batch_size': TRAIN_BATCH_SIZE,
               'shuffle': True,
               'num_workers': 2
                }
test_loader = DataLoader(test_set, **testing_params)

In [None]:
all_test_pred = []

def test(epoch):
    model.eval()
    
    with torch.inference_mode():
    
        for _, data in tqdm(enumerate(test_loader, 0)):

            ids = data['ids'].to(device, dtype=torch.long)
            mask = data['mask'].to(device, dtype=torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
            outputs = model(ids, mask, token_type_ids)
            probas = torch.sigmoid(outputs)

            rounded_probas = torch.round(probas)  # Round probabilities to 0 or 1

            all_test_pred.append(rounded_probas)

    return probas

In [None]:
probas = test(model)

4787it [09:28,  8.42it/s]


In [None]:
all_test_pred = torch.cat(all_test_pred)

In [None]:
submit_df = test_data.copy()

In [None]:
label_columns = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

In [None]:
for i,name in enumerate(label_columns):

    submit_df[name] = all_test_pred[:, i].cpu()
    submit_df.head()

In [None]:
submit_df.to_csv('submission.csv', index=False)

In [None]:
submit_df

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,0.0,0.0,0.0,0.0,0.0,0.0
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...,1.0,0.0,0.0,0.0,0.0,0.0
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",0.0,0.0,0.0,0.0,0.0,0.0
3,00017563c3f7919a,":If you have a look back at the source, the in...",0.0,0.0,0.0,0.0,0.0,0.0
4,00017695ad8997eb,I don't anonymously edit articles at all.,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
153159,fffcd0960ee309b5,". \n i totally agree, this stuff is nothing bu...",1.0,0.0,0.0,0.0,1.0,0.0
153160,fffd7a9a6eb32c16,== Throw from out field to home plate. == \n\n...,0.0,0.0,0.0,0.0,0.0,0.0
153161,fffda9e8d6fafa9e,""" \n\n == Okinotorishima categories == \n\n I ...",0.0,0.0,0.0,0.0,0.0,0.0
153162,fffe8f1340a79fc2,""" \n\n == """"One of the founding nations of the...",0.0,0.0,0.0,0.0,0.0,0.0




In [None]:
model = DistilBERTClass()
model.l1.save_pretrained("/content/drive/MyDrive/finetuned_model")


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
import os

output_dir = os.path.expanduser('/content/drive/MyDrive/finetuned_distilbert')

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

tokenizer.save_pretrained(output_dir)
print('Saved')

Saved
